Semantic Debt: Adding Costs

Next step in our "Why do I care?" discussion is putting some teeth into the analysis: The sharp, serrated, layered-like-sharks-teeth of cost.

Many of my readers work in environments where costs are studiously and purposefully ignored - we're in a post-accountability environment in many organizations right now, where bad design is assumed to be something that gets fixed after the IPO and/or when the adults get hired. But I think there's still a lot of utility in figuring out how to measure the value of good design in data management. Most of us don't work for Unicorns, and if you're doing data management you've got an interest in both getting a reliable system in place first, so you don't have to wake up at 2AM to fix problems that'll be spotted in the daylight, and second because actual people actually depend on your systems. This is true whether you're curing cancer or shipping kombucha starter kits, or even curing cancer by shipping kombucha starter kits.

This is the Ideal Case for the use of a semantic asset:

The semantic asset is some discrete package of data that is (a) repeatable and (b) meaningful to its target.

Repeatable: The asset can be reproduced, even if there's a lot of steps or its composition appears to vary in quality from one period to the next.

Meaningful: The asset has some value to the target. It may or may not be Actionable!!l!l1, but there's someone somewhere who's waiting to consume it.

That is, the target knows what the pieces of the asset mean - whether those pieces are structured hierarchically, relationally or whatever. And the target doesn't have to guess each time s/he receives the asset as to its meaning, as well, because the structure is stable. An unreliable semantic asset is its own kind of hell, but not one we'll deal with today.

"Meaningful" doesn't mean the asset is simple, and there can be all kinds of rules about dealing with variations in structure, or what variations mean. But for the asset to be meaningful to the target the target needs to know how to figure out the meaning. If there are suddenly fields tacked on to the asset that the target hasn't seen before then they literally don't know what they're for. And if fields are renamed or replaced and the target isn't made aware of the change then once again they literally don't know what they're for. This might seem like ETL 101, but you'd be surprised how often even supposedly experienced developers don't get this.

In practice this is a matter of communication. Seems pretty simple, right? When you change your outputs you should tell the people who use them, and when you get inputs you're obligated to use but can't you should tell the people who produce them. I once worked for a Supply Chain VP who asked each unit to map their "inputs" and "outputs" in front of the actual people who were their inputs and outputs; these very smart people, used to thinking longitudinally about their responsibilities, suddenly understood the actual scope of their decisions. BI/DW teams are usually much more aware of their information supply chains, simply because they're much shorter - the teams are usually so small that there's at most two people between a source and the executive using the report. But even when there's only one person to talk to about the structure and management of a particular semantic asset, things still get screwed up.

In practice this is also a matter of degree. If you're moving JSON from some kind of document store to a relational db then the entire process is automated with as much pre-formatted input as possible, or you'll never make it work. Heroism isn't even an option. If you're sending a spreadsheet to someone in Finance in response to an ad-hoc request, but you've worked with the recipient a lot, then automation and change management is unnecessary.

This is the Actual Case:

Its a little more complicated. The target gets an asset that isn't what s/he needs. It may differ from the ideal asset in four ways. It may also be somewhat more unreliable than s/he needs - its not as repeatable as spec'd, for example, because sometimes it arrives at 6AM and sometimes at 9AM, but its critical input into a process that starts at 7AM. It may not be as meaningful as necessary too; the asset is missing rows or fields, or sometimes there's a whole presentation layer that seems to have been forgotten, or sometimes a field that's supposed to mean one thing means something else entirely.

So how do we close the gap between What she needs and What she gets? Its not that hard to figure out how much it would cost to close the gap. We can itemize the tasks and get estimates on analysis, ETL and visualization from our BI/DW team. That tells us what our estimate is, and there aren't many modern BI teams that don't have a fair degree of facility with that kind of estimation. In fact some teams have gigantic lists of these things, carefully sorted and prioritized. But that list and its associated estimates is the dollar value of the organization's semantic debt.

A couple of points. First, you might wonder which category of debt is more expensive. There's Equivalence debt and Coverage debt, as you may remember.

Equivalence debt occurs when you've got two assets that need to be merged. It may be, for example, that What she gets is a report about "Bounced emails by campaign" from the old ESP, and What she needs is a report that merges the Bounced emails reports from both the old and the new ESPs. The asset might then be in both Type 1 and Type 2 debt; the former if the schemas are different between the two sources, which is virtually certain, and the latter if there's two sets of natural keys between the two as well.

Coverage debt occurs when What she gets lacks the data necessary for the operation its used for. Consider an asset consisting of the analysis of log data, fed from a device that seems to be functioning correctly. The analysis of the data requires some series of positive steps required to produce it: Sum these fields, count distinct values in those, transform that field using this xref so it shows these categories, and so on. Those analytical steps will have some defensive code built in, if they're built by anyone with any experience at all, so e.g. NULLs will get tossed from sums or counts or even xref matches. All of those rules add up to a model of the log data, a set of assumptions about how the data behaves that help transform the data into something meaningful at the target. We have coverage debt when there's stuff happening in the log data that isn't accounted for in the model. It might be as simple as the sudden appearance of negative values in a field where the "sum" operation doesn't expect them, leading to lower-than-expected sums that make the summary suspicious. Or it might be that values in category fields start showing up that aren't in the xref, and so the asset breaks mysteriously or ignores those values.

So which is more expensive? It depends on the state of your BI team. I know "it depends" is the most useless phrase in modern English, but here's how I think it works. Equivalence debt is easy enough to fix with an experienced developer, especially someone who's got a lot of experience with SQL. If you've got someone who knows their way around the two sources (or even just one of them) then they can generally work out identity conditions (for Type 2) or schema-equivalence queries (for Type 1) pretty quickly. Sometimes of course the two sources can't be reconciled, and so everyone chases down a rathole for six months doing something like "the Philosophy of Order Management." (See the beginning of this post for an example.) But you can, as a BI Manager, probably figure out roughly how long it would take your junior people to resolve a case of Type 1 debt vs. asking the more senior people to do so.

Coverage debt is harder to resolve because its a different kind of uncertainty. In the example we used above, of the log data shoved into a summarizing report, you could easily produce an exception report that showed data that was ignored by the first summary; so e.g. strings in the source that don't match strings in the xref. How expensive would such a report be? Probably not that much, and it would help you avoid all three Failure modes. That is, there's no immediate failure, because you've covered both the requirement and an open-ended asset for everything not in the requirements. There's no sudden failure, because the open-ended asset will tell you if something happens in the log data that you didn't expect. And there's no gradual failure, at least not if you stay on top of accommodating unexpected results shown in the exception report into the main summarization.

But most coverage debt isn't so easy. What happens if there's no easy way to accommodate unexpected results? Maybe that's because there isn't anyone who knows what they mean - i.e. you get values in the log that may or may not be by design, but since there's no one who knows the design no one knows what they mean. I suspect all device-generated data has this problem eventually; not necessarily because there's new error codes or easter eggs, but because time has a way of making us all operate outside our operating parameters.

In those cases, the cost is much more open-ended. You can hire a data architect to design you an customer-event reporting system, for example, but if there are complex or unknown event types in your organization the process is going to be something of a fishing expedition. Many "Customer 360" programs falter when they start to look at the details of customer behavior, for example. You know they bought something and you know they bought something again, but all the interactions you've got with them between those two events are locked up in your weblog-reporting systems, or uninstrumented, or god forbid lost to time forever. Or they live in systems no one has access to, or that are generally ignored - for example, the physical plant of your ecommerce system, which may have various moments of unreliability that explain why "add to cart" trends are so spiky.

I think its generally easier to deal with Type 3 coverage debt - underdetermined schemas - using the open-ended "exception report" method on a single source. In fact I think its a good way to turn junior developers into intermediate ones, because it forces the developer to think of what they don't know about the source and the role the source plays within the business. But when it comes to larger, more diverse sources or Type 4 debt, the costs to fix it are going to be larger simply because it takes more time and the resolution is not necessarily determinate.

Add up all of these interval costs and you get your total semantic debt:

Semantic Debt

Monday, October 8, 2018

Adding Costs

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse