Monday, August 27, 2018

Assessing Semantic Debt

Introduction

That last post was a long one.  But I think it was valuable, in that we've now got a pretty comprehensive explanation of how semantic debt comes about.  It is, as I've repeated almost ad nauseum, an inevitable consequence of the difference between what an org's data models describe and what the actual business model(s) do.  We think we can put all of our data into a database for reporting and various kinds of automated action, but then it turns out we can't put all of our data into a database.  So we need to cobble stuff together, make assumptions and guesses, and skip doing stuff we want to because we can't store the data.

In a future post I'll talk more about the philosophical underpinnings of semantic debt.  That will almost certainly be more fun for me than anyone else, and I'm sure it will get almost no readers.  But today we need to talk through how to assess and evaluate semantic debt.

When we talk about "assessing" a problem we want to classify it, and so what we need is a better taxonomy.  We've got the beginnings of our taxonomy already: There's our four types of debt, which break into two categories:



This is helpful, but its not complete.  For one thing we're kind of missing an X axis on the matrix, just for the sake of symmetry.  (Symmetry is such a valuable architectural heuristic it seems like a shame not to make use of it.)  So let's work through various ways a semantic asset might fail.  In an earlier piece I called these failure points and for now that seems as good a term as any.  An asset's failure point is the way in which an asset might develop one or more Types of debt.  I mean "develop" here in the sense that the asset's relationship to the business changes, and so it fails.  But more precisely, because its the business model that changes, an asset's failure point is the way the delta between the business model and the data model becomes obvious to the organization.

In that earlier piece I identified three failure points: Immediate, Sudden and Gradual.  There's a really easy way to explain failure points, which is visually.

Immediate failure

An Immediate failure point occurs when a semantic asset covers some portion of an organization but not all the relevant portions it needs to cover, and so the asset is in debt from the beginning of its life:



The data model at t1 covers exactly 4/5ths of the business model at t1, and stays the same throughout time.  The data model's missing piece is the light green entity in the bottom right of the business model, and that entity's relationships with two other entities.  In this case the semantic asset starts out in debt, at inception.

Now there's a way in which you could say that every semantic asset is in some kind of Immediate debt.  No single data model perfectly covers everything it could cover, in the limiting case.  This might be because of lack of foresight, or Godel's Incompleteness Theorems, or because the Universe is analog and models are digital, or whatever.  But no model is perfect, and so isn't it the case that all models are in Immediate debt?

But that kind of inexactness is not what I'm getting at here.  Data model construction is bounded by time and effort, and its simply and sadly never the case that we're allowed to do data modeling to perfection.  We always run out of time.  Which is ok; there's no point in making the perfect the enemy of the good.  What an Immediate failure of a semantic asset indicates is a scoping decision: A decision - conscious or not - to leave certain entities out of the data model.  There's all kinds of reasons why we might leave entities out of a data model, some good and bad.  Obviously conscious awareness of the scope of our data model with respect to the business domain its intended to cover is better than being unconscious of Immediate failures.  But even in the case where we're conscious we've left X, Y and Z out of the model we may be unconscious we've left out S, T and U, and if S, T and U are relevant then the model is in Immediate failure.

Let's take two concrete examples.  The most common example of Immediate failure is a decision to include certain business processes in a data model and not others.  Suppose we're building a data model for "genome analysis."  "Genome analysis" is a big picture, though, so we have to start thinking about what specifically we want to store; we decide we want the exome, for example, or we really want to start analyzing the variants.  Or we want to think big, so we pick everything but we don't include in everything specific types of everything, like satellite DNA, RNA or methylated DNA.  Now we might hope that our decision to "model DNA sequences" would allow us to more-or-less easily expand our base data model from an exome focus (or a methylation focus, or whatever) to everything else; after all, how different can methylated DNA or the DNA underlying the proteins in the exome be from everything else?  Its all just base pairs, right?  But when we do our modeling exercise we quickly discover that (a) the SMEs are not yet clear on the logical relationships between all the stuff we surface in the modeling exercise, because the science still isn't all that clear, and (b) there are attributes, category hierarchies and relationships that hold for one kind of DNA that aren't present or useful for others. 

This is a really typical problem in any modeling exercise: An order is an order is an order, in a more mundane context, unless the order is for Walmart or Target or Costco, in which case there's really at least four kinds of orders.  Its almost as if the exome is a different animal from the Whole Genome, a different entity, even though it appears to have the same fundamental substrate.  SMEs may talk about whole-part relationships in ways that make it appear the exome is a literal subset of the Whole Genome, and in certain practical ways that might be true. But actual analysis shows the attributes shared by the two entities are not set-theoretic or in a subset relationship, but a largely different set of attributes.  Which is to say, the attributes that make exome structures meaningful are not shared with other types of structures; which is kind of obvious, when you think about it, but still means we're not just modeling "DNA sequences."  We're modeling different animals.  (I'm not arguing they are different animals, from an ER standpoint, just that they're talked about differently by the SMEs.)  And so while (b) there may be an underlying and fundamental attribute set you could ultimately reduce both sets of genome data to its (a) a lot of work to get your SMEs to clarify how all of that fits together.  There may be some long-term advantage to doing a classic reductive modeling exercise, the kind of thing that looks a bit like Category theory, that would allow the resultant model to accommodate all the various structures covered under the phrase "DNA sequence" - Whole, Exome, Methylated, whatever.  But more than likely there will not be time to do a reduction, and the process will generate some unwelcome conflict with SMEs who may feel the exercise unnecessarily questions their knowledge, the science as a whole, or their own often-considerable modeling skills.  So a conscious decision will need to be made to create a model that covers some portion of the possible domain, and not others.  That model will be in Immediate debt because it will not be fundamental enough to capture other parts of the domain, or at least not without some unknown amount of modification.  That's a conscious decision to create Immediate debt, and its the virtuous cycle, or at least as virtuous as such a cycle can get, after all the conflict.

On the other, unconscious side of Immediate debt, we often run into scenarios where we're hoisted on our own unstated assumptions.  Many of the worst examples of Type 4 debt, Epistemic bubbles, manifest as Immediate debt.  Our paradigm example at the link is the order-management schema that doesn't include the order status CANCELLED, which means an org with that content schema can't manage cancellations of orders.  This might appear to be a Potemkin case, but its not hard to think of organizations that haven't thought through how to manage ex-customers - just think of all the social media brands, for example, that are surprised to discover former customers want nothing to do with them upon cancellation, and can only manage a "soft delete" for those customers' graphs.  These are organizations that have discovered they have Immediate debt, instead of consciously choosing to create it the way in the example of our genome-analysis organization.

So an Immediate failure is one in which the data model does not match the business model from inception.  Its better all-round when Immediate failure is a conscious and/or documented scoping decisions.

Sudden failure

A Sudden failure occurs when a data model that had previously been in synch with its business model is suddenly no longer in synch:


In the diagram above, the data model and business model are in synch until t3, at which point the business model changes dramatically.  Entities that had been in the business model have been replaced by new entities and relationships, and the business model is no longer reflected in the data model.

Sudden failure may also be conscious or unconscious.  An organization may buy an application to manage a relationship, unaware that the new application needs integration into the data warehouse to support reporting.  This scenario happens so often and with such hilarious impacts on projects it probably needs no further elaboration, but just in case: Any time there's a new communication process with customers that might have two or more options, for the results of that process to be reported some integration with the rest of the org's reporting system needs to be done.  Its very often the case that go-getter product managers, for example, will hire a SAAS application to collect data on a specific customer interaction only to discover much later that (a) the data collected from the application hasn't been made available to the DW people for integration, and often (b) there's no automated way to extract the data from the SAAS application so that it can be integrated.  (Its never been clear to me why someone would build a SAAS application that restricts its reporting to manual intervention, via an interface that only a developer's mother would claim to love, but this happens so often I must be missing something.)  In such a case the product manager has unwittingly created a Sudden failure in semantic debt: Where the DW had previously been able to report on acknowledged processes, the new process creates semantic assets which are inaccessible to the reporting system.

In such cases Sudden failure may be recognized for what it is, and handled with the kind of jury-rigged, scotch-tape-and-spit approach that can survive for years as long as no one makes any further changes (or the SAAS product doesn't go out of business).  Likely most of the high-drama cases of Sudden failure you find in the wild are these kinds of new process- or product-introduction situations, where there's an assumption that communication has happened when it hasn't, or that the DW or Integration team will be able to pull a rabbit out of a hat.

Sudden failures also occur often in the ongoing process of software development.  Consider the extremely prevalent practice in software teams - teams without strong training in data governance - of tacking fields onto the right side of a database table to store additional attributes, without thought for how those fields might impact the grain of the table, or again whether the DW team has been informed the fields exist.  This happens more often with developers who've rejected relational databases for the perceived speed and efficiency of NoSQL systems, but the thinking is pretty much the same in either case.  First, the cardinality of those new fields may require some kind of normalization - even in a document-oriented model - that argues for a new table, object or whatever the local "container" concept is.  And without that normalization, queries against the table to collect either the new data, the old data or the new data in combination with the old data become significantly more complex.  Old queries may require a whole lot of revision, to the point where something downstream breaks and the relevant VP suddenly becomes aware of the concept of "technical debt."  More often the new fields are simply not exposed to the Integration team, and while downstream users believe the data should be available they can't see it.  Integrating new fields into the DW requires an impact analysis.  Sometimes the new fields can be tacked on as easily at the target as they were at the source, but if the target is a multi-terabyte fact table (or its equivalent in  Hadoop) then the refresh process can be onerous and lengthy.

In  either example of course we might intentionally incur a semantic debt now knowing full-well that we need to fix it later, or at the very least that we've incurred it.  It may be that in a virtuous case of Sudden failure the go-getter product manager has consulted with everyone in the information supply chain from SAAS provider to report developer, identified what assets need to be modified or created, and a determination has been made as to what can be done, when, and by whom.  I have seen a few such instances of planning and they're exhilarating when they occur.  They tend, like any other form of positive reinforcement, to create further virtuous cycles, as people wonder why they don't do the right thing all the time.  When planning doesn't occur it becomes clear there's a break in the process, at the very least, and at worst that the organization is ok with process sleights-of-hand in order to call something "done."

Gradual failure

Gradual failure almost seems like a version of Sudden failure.  In Gradual failure the drift between the business model and the data model is small and imperceptible over time, but at some point becomes significant and noticeable:


(This is my graphic from the post "Semantic debt over time."  Neat, huh?)

Now if we were doing philosophy instead of data management we might spend a lot of time arguing over whether Gradual failure was just a type of Sudden failure, but with a longer timeline.  We could even probably argue over whether Sudden failure was just some kind of unconscious Immediate failure.  And if you look at the distance between t2 and t3, for example, as being made up of a bunch of smaller timeslices, then the gradual drift between business model and data model described by Gradual failure really looks a lot like the sudden break we see in Sudden failure.

But we're not doing philosophy, we're doing data management.  Its true that at some layer of abstraction all model failures are Sudden, because they're unexpected.  Similarly all model failures are Immediate, at some level of abstraction, because - to get all Existential - the model contains the seeds of its own failure, as no model is perfect.

But removing the color is unhelpful.  In day-to-day practice its more useful to see which of our current models are in danger of which of the failure points.  Some models we know to be in Immediate failure.  Other models we suspect might be subject to Sudden failure - perhaps some of our content schemes, like our order-status schemes or product-type schemes, look like they won't survive our organization's wild-west-style new-product introductions.  And other models are doing just fine but will eventually be subject to Gradual failure.

For example: Lots and lots of organizations survive on old versions of enterprise software products.  This could be their ERP version, which might be in danger of obscolescence, or their digital analytics tools or ESP.  A thoughtfully managed asset like an ERP version might continue to be useful for end-users and customers alike for a very long time; or a data warehouse might survive for decades because it does exactly what its users expect.  But in the case of software there's patches or small restrictions in the initial package that get removed in later versions, or new features or expanded types of reporting or collection available in a new version that aren't in the old, that make the connections within an enterprise suite work better.  In the DW there's changes to COAs or acquisitions that never really got integrated properly, and so carry some overhead that someone needs to remember the details on.

Consider for example a digital analytics suite where the newest version allows a machine-learning toolkit to access its raw data in real time.  Older versions of the suite can't use the ML toolkit because they don't have the ability to let the toolkit to "look under the hood."  The new version permits a secure connection so the ML happens seamlessly, as part of regular operations.  In that situation you'd want the newer version, if you wanted the ML, and while maybe you could hack some kind of API into place its never quite the same thing as a native connection.  But you can rarely just do an in-situ upgrade, if the old software is really old compared to the new stuff.  The new version has a new model, and probably new languages and ways of organizing events as well.  At some point the Gradual failure has gotten to the point where the old version is simply unrecognizable to the new version, and needs to be replaced.  This may happen before the product is no longer supported, or it may happen long after, but its likely that it will happen.

The decade-old DW is the same: The model works for a smaller and smaller portion of the business.  While the part of the org the DW works for may have a lot of power with respect to other departments, the value of the DW for the org as a whole has diminished and a new set of models need to be built.  In  the DW scenario we often see clear Type 4 debt problems, in that the gradually-diminishing value of the data model has led us to restrict the kinds of things we do in the org, so we can at least report on them.

Gradual failure is thus distinct from the other two, in that each individual lapse may not be vitiating.  Its the cumulative impact of little losses and drifts over time that makes the model fail.  In Immediate failure we know there's a problem right at the beginning of the model.  In Sudden failure we find a problem that hadn't previously been noticed.  In Gradual failure none of the individual gaps between the business model and the data model is significant enough to warrant pulling the alarm, but in the aggregate we have a problem.

Assessment Matrix


The three failure points let us start to classify semantic assets.  Here's what our assessment matrix looks like:


     
For each asset or group of related assets, we can decide whether (a) it has one or more types of debt, and (b) how severe each failure would be, on a scale of 1 (least) to 3 (worst).

Let me put a little more detail into this assessment matrix for my next post.



No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts