Semantic Debt: Categories of Debt

In my last post I reviewed how semantic debt happens and put up a picture of a data model staying static while a business model changed. That's the core insight to the concept of "semantic debt": An organization changes, but the data models that describe the organization don't keep pace. At some point the organization spends a lot of time compensating for the drift between data and business models. That's semantic debt.

It costs money to deal with semantic debt. That cost takes the form of an opportunity cost. You might incur that cost because the org's reporting system is brittle or needs a lot of machinery to manage semantic debt - perhaps because there's a lot of Excel spreadsheets that need to be merged into the system to cover for the fact that the data model doesn't hold the data that's in the spreadsheets. Or the cost may be incurred by a virtuous cycle of data governance and change management, where the debt is managed positively and proactively by people who can balance the need for change in the data models with the software release cycle. In either case the debt needs to be managed, but whether its a hidden or an obvious cost is a decision by the organization.

If we know that semantic debt is inevitable, how do we evaluate where we're at?

Each of those models and the data in them is a "semantic asset." While its certainly possible to look at an organization's semantic assets and by fiat claim anything old is deep in semantic debt, its a strange fact that sometimes the business just doesn't change. Sometimes really old data models are still really good from a semantic standpoint; the organization still uses them productively. And I'm not talking about a type 4 or "epistemic bubble"-type issue, either, where the organization has limited its activities consciously or unconsciously so it doesn't run afoul of its data model. Some organizations do just fine on old semantic assets.

(I think we'd all like to think, as data modelers, that we've designed a few of these: data models that are still valid for the business ten or twenty years on. In practice of course this may be just wishful thinking.)

Broadly speaking there are two types of failure, or two ways a semantic asset might fail: Coverage and Equivalence.

Coverage failure occurs when our semantic assets don't describe as much of the organization as we think they do. We run into this scenario all the time with organizations that have expanded - e.g. added new product lines - but continue to use an old data model.

Coverage failures include Type 3 and 4 debt. Type 3 debt involves scenarios with under-determined schemas or content. Type 4 debt is "epistemic bubbles," or scenarios where the organization restricts its behavior to conform to the model and is thus blind to opportunities it might otherwise pursue.

Equivalence failure occurs when we have two semantic assets for the same organizational process, but they can't be mapped into each other. Type 1 and 2 debt are Equivalence failures: Type 1 is "incompatible schemas or content," and Type 2 is "incompatible natural keys."

We often see Equivalence failure in orgs that have two applications for the same function, like email marketing. In fact this problem is so common in email marketing I would wager that most MDM projects are designed to solve problems that have arisen because the org discovers its got rampant equivalence failures between its two email marketing systems.

Perhaps the paradigm case for me, however, of Equivalence failure is the phenomenon of irreconcilable categorization schemes, which is an instance of Type 1 debt. Remember this image from an earlier post, where I talked about ERP migrations and the difficulty in reconciling order-management status schemes?

Its a fair bet that any time you see a categorization scheme some significant semantic debt has been built up. Let's think through this for a minute or two, because its illustrative. Note that we're not talking about incompatible data models here, but content schemes. The data model for the two incompatible content schemes might be identical pre- and post-change, but the way those schemes divide up the world - the way they "model" the world, to be more precise - can't be mapped from one to the other.

One reason categorization schemes rack up debt so quickly is because they're almost always inherently unstable. Very few categorization schemes are immutable, or as valid for me as they will be for my children or my grandparents; that is, as the Philosophers would say, very few categorization schemes are natural kinds. Consider the category "race," which is a common personal attribute in American software applications. Very few applications in the rest of the world even record "race" as a personal attribute. Those that do recognize that its a constructed category and treat it as customizable, and as such can't be expected to provide consistent reporting. But as little as a decade ago "race" was considered an immutable category in the US, and the composition of the "race" scheme considered a fixed and mutually-exclusive set of components. That is clearly not the case now. "Gender" is another category that until very recently was considered a fixed and binary category, with exactly two subcategories. Now "gender" has a number of values, and depending on the purpose of the application those subcategories may or may not be mutually exclusive. Its certainly possible to engage in a long and voluble argument about how "political correctness" has changed these categories. But the simple facts are that (a) in thousands of years of philosophical inquiry no one in any of the traditions - European, Chinese, Indian, Aztec, Plains Indian, to take just a few - has been able to propose more than a handful of logically-neat immutable categorization schemes, and "race" and "gender" aren't even in that list for the modern European tradition, never mind anywhere else. And (b) insulting your customers or your customer's customers is never the path to happiness and prosperity. So its best to bite the bullet and, despite your personal preference, learn to live with the mutability of categorization schemes. Acknowledge that your scheme will incur semantic debt, and prepare for it.

When a categorization scheme changes it is more often than not the case that any reporting that uses that scheme must declare an inflection point, a year 0, that prevents temporal comparisons. There are mitigation strategies that can be used to lessen the impact of these inflection points. One of the simplest is to restrict the scope of change in a categorization scheme to subcategories; you partition the world up to reflect five subcategories instead of four, but you don't change the category those subcategories belong to. In that case while you can't make comparisons on events pre- and post-change that use the subcategories, all of those events still fall into the same category. You might try a semantic-equivalence change, too, although usually if your categories need to be changed its because the old semantics doesn't work anymore. A third option I've used is "managed change," where you phase in changes to subcategories and categories over time so that while one set of events can't be reported on because of an inflection point, the others can, and while you migrate some local reporting fails temporal comparisons but other reporting stays stable. Or you can just "bite the bullet," "rip the band-aid off," "dive into the deep end," etc., and make the change as global as possible.

Second, categorization schemes are often implemented in ways that make them difficult to change. There's a longstanding application-development heuristic that says one shouldn't put categorization schemes into code, as e.g. enums. You'd be surprised how often that heuristic is ignored, particularly now that everything goes into a NoSQL database and nobody believes in relational lookups. Many developers think the heuristic is intended to avoid situations where QA has to do a regression test on a change to the enums, and assume the regression would be required because of something simple like e.g. a change to the width of the strings in the enum. But the heuristic is driven by the fact that putting enums into code increases the volume of semantic debt, and incurs at least two types of debt. Changing the schema in code requires at the very least a software release, and as that entire process may be opaque to the business users who rely on the categorization scheme it is very often the case that the scheme and the software come to mean something very different for the users than for the developers and the architects. To avoid the release process users may take to categorizing events out of the process flow dictated by the application, perhaps in Excel or a similar system, simply to keep the business running. Or development teams may de-prioritize enum changes in favor of sexier or more exciting or complex changes, leaving the business team that needs the change to otherwise "make do" or otherwise restrict their processes until such time as they can get the change made.

In an earlier post I gave the example of an organization with a CRM application where the returnable product list is an enum. (Again, some software developers may scoff at that example as a straw man: Surely no one would actually do something that stupid! Uh, yeah. A little more abstraction may reveal some interesting parallels with the scoffer's own situation.) Products that aren't in the list simply can't be returned, and so any disconnect between the process of updating the RMA-able product enum and the process of releasing products to customers will result in a bad customer experience and a lot of unhappy CS people. I think the problem here goes deeper, though, than just the point of data entry. In an architectural sense treating the "product" (in this case) as simply a value in a variable in a formula for "how to return a product," and not as an object in its own right, means the process of return is also fixed: It becomes very difficult to modify the return process itself, architecturally, if all returns follow a fixed formula. What if, for example, we want to manage two return processes? We can't manage a single list of RMA-able products in a single enum, we need to manage two. But since that's pretty difficult to manage, we probably just restrict what we can do to what the enum-based software can handle. So putting enums into code is bad practice architecturally because it puts us into an epistemic bubble, restricting the options available to the organization, and that's Type 4 debt.

So the four types of debt fall neatly into these two Failure buckets. I've identified four types of debt:

Debt that results from two (or more) incompatible but otherwise defined schemas or content.

For example, two email marketing systems that are organized differently. Perhaps one is relational and the other is hierarchical.

Debt that results from two incompatible natural keys.

For example, suppose one of our email marketing systems identified individuals using their ERP-assigned customer key, and the other used email address.

Debt that results from under-determined schemas or content.

For example, file systems with poorly-documented or under-governed files generated by devices of various versions.

Debt that results from epistemic bubbles.

As in scenarios where an organization restricts its business opportunities because its data model can't accommodate them.

Dropped into a matrix, our immutable and almost-certainly-a-natural-kind categorization scheme looks like this:

(You can of course use "Type I" or Type III debt," if you think the Roman numerals look classier in your governance deck.)

In my next post we'll talk through how to assess the volume of debt incurred in each category and type of debt.

Semantic Debt

Thursday, August 9, 2018

Categories of Debt

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse