The model I present below concisely is an argument of sorts, and some of the premises hint at mathematization. Mostly the premises are logical, and chained together to form a theory. A theory can be thought of as a two-part thing, (1) a set of axioms, and (2) a model or list or set of objects of which the axioms are true. Derivations of theorems from the axioms are also then true in the model.
I will assign each premise a number. Some premises are just glaringly obvious to people who do different things, but each step is critical to the theory. After the argument I'll provide some depth and motivation to the model.
1. All data collection by a software application or system has an intent as represented by the sum total of its tickets, documents and most important persisted records, at a given point in time.
- For any organization there is more than one such system, and that superset of systems is called an ecosystem.
- The intent of each system is represented over time with its own distinct persistence models, which in aggregate form the system's semantics.
- This theory's model or domain of objects is this ecosystem of distinct semantics, and their contents.
2. All data collected by a system is persisted as one or more logically-contiguous records, stored together. One or more records may be used to represent the contents of a form, although not always or even often. A record-instance is necessarily composed of snapshots of different aspects of an instance of a state of affairs, and each record-instance is represented as true of a state-of-affairs-instance at a point in time. The individual snapshots (or "chunks of data") composed into a record are called attributes of the record or also fields. The particular order, type and intended content of a record's attributes is its data model, and the representation of all the records of a specific type - i.e. a specific collection of contiguous attributes, like an order - is called a table. The aggregate of all of a system's record types, i.e. the aggregate of their data models, is also called the system's data model.
Any given composition of attributes may be a record, however, and no intent motivating the composition is otherwise documented in the table or the data model other than can be conveyed by the name-string associated with the table. The meaning of any individual attribute composing a record, including the timestamp, may be:
- Known to be clear. i.e. confirmed by documentation of intent
- Known to be uncertain (i.e. ambiguous or vague as to intent)
- Known to have no meaning, i.e. identified as noise or NULL
- Known to be in an uncertain state between (2) or (3)
- Believed to be (1) but actually (2) or (3)
- Believed to be (2) but actually (3), and the converse
3. All record attributes are organically (i.e. necessarily) divided into two kinds of overlapping thing:
- An entity is a series of 1 or more attributes, persisted in one or more types of record, that represent a discrete instance of an object in the source application's data model. A record's attributes may represent multiple entities. An entity may be a concrete thing such as a human or a product, or it may be an abstract thing such as a file-storage location or a product hierarchy node.
- A natural key is a series of 1 or more attributes, persisted in the record, that can be used to identify an instance of an object in the source application's semantics. A record's attributes may represent more than one natural key. Logically a natural key is the identity condition for an entity instance.
- There are arguably other sinilar things in a record. Our Semantic Debt theory at this point doesn't take a position on them.
5. Suppose now there is a real-world entity, say a customer named Dave, who has entered his data into two different systems, S1 and S2. For any two distinct semantic systems in an ecosystem of semantic systems about the same real-world entities and natural keys, there is a cost that must be paid to integrate those two records. This is the semantic debt created by the need to integrate those two systems. The net integration cost across the ecosystem is the net semantic debt of the ecosystem.
- If key attributes in the ecosystem share meaning across each record instance at which they occur, i.e. they are standardized, and the attributes are otherwise isomorphic in data type or clearly inconsistent in actual value, then the cost is 0.
- If they do not, then someone must make a decision as to which version of the truth is the correct one, and merge appropriately.
6. There are only four kinds of integration decision that must be made to pay off semantic debt. These four kinds can be divided into two larger categories, Equivalence decisions and Coverage decisions.
7. Equivalence decisions must be made on mappings at the sub-record level between any two attributes or natural keys that purport to be about the same real-world entity or identity, in our case the aforementioned Dave.
8. Type I debt: For any two attributes representing the properties of an object in two systems in a ecosystem that has an identity outside of the ecosystem and/or is otherwise an identical object, e.g. two representations of a customer, those attributes can have the same meaning or not.
Consider the case of two systems S1 and S2, which have tables A1 and A2 that purport to be about customers. Before we can be sure that S1.A1.DATE_OF_BIRTH = S2.A2.DATE_OF_BIRTH, we must know what DATE_OF_BIRTH means in both S1.A1 and S2.A2. This case is straightforward if context pins down the meaning, e.g. when A1 = CUSTOMER and A2 = CUSTOMER. This is not so clear when e.g. A1 = CUSTOMER and A2 = HOUSEHOLD.
9. Type II debt: For any two natural keys representing the identify of the same real-world entity, those natural keys can be semantically equivalent or not. If they are semantically equivalent, then the values of the attributes that make up those natural keys can be mapped automatically. If they are not, then the values of those attributes on which those natural keys occur must be prioritized or mapped with a semantic judgment.
Consider the case of two attributes chosen as natural keys in their respective systems, S1.CUSTOMER.SSN and S2.CUSTOMER.SIN. Both attributes serve as natural keys, but they are not semantically equivalent and would not be expected to have equivalent values for the same real-world individual, assuming SSN is a US Social Security Number and SIN is a Canadian Social Insurance Number. We may deduplicate records with 100% accuracy on either string but discover otherwise duplicate entries if an individual has one or both keys. Accommodation must then be made for Dave if he's a customer.
10. Coverage debts are decisions that must made on mappings at the entity or natural key level between any two entities or natural keys that purport to be about the same real-world entity or identity. There are two types.
11. Type III debt: For any two records, every attribute will have a distinct known domain (i.e. be of type 2.1) or not. If not, then the mapping for that attribute will also have both first-order and second-order complications.
Consider the case of two attributes, S1.ORDER.DOB and S2.CUSTOMER.DOB. Assume we know that S2.CUSTOMER.DOB is intended to hold the S2 customer's date of birth. We wish to integrate the data in S1 but find no DATE_OF_BIRTH field or otherwise equivalent anywhere in S1. Our examination of S1.ORDER.DOB discovers mostly dates, with no apparent pattern, plus some strings. As one field is certain and the other is not, any integration would lead to a net lower data quality for both source fields. S2.ORDER.DOB is in Type III semantic debt until it's intent can be determined.
12. Type IV debt: For any two records in systems S1 and S2, every meaningful attribute will either find a semantically-equivalent attribute it can be mapped to, or it won't. Attributes in S1 that can't be mapped to any attributes in S2 are said to be in an epistemic bubble.
Consider the case of an attribute, S1.ORDER.DOB, which is discovered to mean date of best estimate in S1. An examination of S2 reveals no equivalent attribute. S1.ORDER.DOB is now in Type IV semantic debt.
Now I want to motivate the Semantic Debt model with some inductive historical argument.
In the early course of my career (~1996-2005) all data was understood to be structured, by which is meant relational in the sense defined by Codd's relational algebra, a fragment and extension of set theory. That business processes are also primarily relational, in that they're primarily transactional, Codd would surely say was a coincidence. But the databases were all structured relationally, sort of like the business. Software applications however structured data as objects, and even often more like documents, and these are primarily not relational. This created what was known as an impedance mismatch, or a mapping negotiation between the way the database represented the software's object, or how the software represented the databases's relations and entities.
Since ~2005 or so, active resistance from Agile practitioners to coordinated cross-team data management has meant generations of software engineers who's only experience with data management is schemaless storage. These are intentionally non-relational foms of storage, designed to permit developers to change semantics on their documents (e.g. forms) at the whim of the user. Eschewing good data management practices - because not just relational practices were ignored, but literally any technique that might have downstream analytical commitments - permits a freedom and efficiency in software development that's been truly miraculous. But that has meant an explosion in the sources that must be integrated to get to a coherent picture of the digital ecosystem, as well as an explosion in the quantity of data created. A modern successful digital transformation has succeeded likely despite and possibly because it has practically ignored the rules of good data management, which prioritize integration. Most of the data produced as a result of the transformation - the implementation of the software - will be for all practical purposes unusable and unintegrated.
What do I mean by integration? Any company that wants to compete properly will have to collect and manage data about orders, products, customers, customer service calls, deliveries and returns, and marketing and sales, at minimum. They may also have to worry about facilities, etc. Each of these functions is supported by one or more software applications, a distinct semantic source. Each source has customer data because a customer may want to change their data, and so they do. And so each of those sources must now integrate its picture of the world with every other source or repository of customer data, both to ensure they have the latest but also to mark the change and learn from it what the customer wants the company to learn.
A coherent picture of an organization's customers logically requires (1) the superset of customer records from each application used across the org, and (2) that superset must be mapped internally, system to system, to maintain coherence. Customer data is scattered across many distinct sources. Each source presents a different semantics for customer. Each semantics must be reconciled into a coherent picture. This job is usually what's done by data modelers, of which I am one. That job is more often described as data analyst, which is often perceived pejoratively by professional software people.
By 2015 the number of possible sources across an org had proliferated. Each developer could now create a fully-functioning application collecting data, with its own separate bubble all their own, with no standards. The removal of all data modeling considerations from most Agile teams meant multiple versions of the customer, changed at will. And so teams within organizations reacted like divide-and-conquer siblings, adopting polar opposite naming conventions and hierarchies just because. Even a light hand at "data governance" required a full-time data modeler willing to offer almost chaplain-like services, a gentle, patient and largely-ignored guidance across a dozen or more teams with potential document stores.
Some larger orgs were able to tie model changes together and maintain consistent analytics, or alternatively "proofs of quantitative metrics." But for many companies the discovery that all these newly-productive software teams were spinning out data that followed no consistent semantic scheme, that couldn't be integrated - well, it didn't lead them to stop the teams. Just stop the integration projects.
The velocity of the data collection was prioritized over integration on the theory that at some point that Big Neural Network would be smart enough to infer structure, on tap. While its possible that day has arrived with chained LLMs, I don't hold out a lot of hope that my day job has finally finished. Because the problem size has increased in scope and complexity.
The scale of dis-integration is enormous across our current digital ecosystem. As I started to work as a consultant and started to think about scaling across teams, it became clear to me that this problem of distinct semantic sources is everywhere, in organizations large and small. In general, organizations with a core digital competency have executed on their core digital competency well. But as a result the vast amount of data collected by any one application is used for nothing but the bare minimum required to justify the annual expenditure of the license fee. In rare other cases the core digital competency is enormously profitable, and the data there is collected, used and integrated with discipline. But anywhere else data integration competency is considered a miracle.
The scale of this dis-integration then is vast because software is everywhere. Networks that can talk to each other are rare, and that is often on purpose. "Exclusive" and "proprietary" are built into the ethos of companies, some the largest in the world, who attract baristas and full professors as well as software engineers and graphic designers to their walled garden experiences. Exclusivity builds in a cost to integration from the very beginning. It also promises ease of further integration. That is precisely the point, of the marketing and the architecture. Outside of the walled garden of an e.g. Apple the rest is chaos.
So we now survey a landscape where a giant online product company with say exactly 100 properties has 100+ versions of its customer, 100+ order management systems, 100+ product catalogs, and so on. 100+ marketing campaign systems. How to establish order on that? It is possibly not possible. In some cases that is by design, to segregate customer data from big brother, and completely necessary. But can it still continue in every domain at that scale and, if not, why not?
That is all by way of motivation for the aforementioned model. Some of the rules that make up the model may in fact simply be well-known mathematical consequences I'm just not aware of. But this model is the structure of our problem.
No comments:
Post a Comment