Semantic Debt: Thoughts on counting mistakes in data management

The second part of our thinking-out-loud/not-ready-for-prime-time implementation discussion is, "how do we know that an asset has failed?" You might also say "how does doing any of this improve my day-to-day?"

This seems like a hard question. I think its the big issue in the implementation program, which is only the latest subsection of what I've been yammering on about now for six months. You might agree there's this thing called semantic debt, and you might think there's ways to assess it and so on. But how do we know that something is actually in debt without manually inspecting every ETL package? I've been somewhat apologetic in past posts for not getting to the answer to this question. It seems like that should be the first thing we care about, right?

I don't think its that important we figure out how to count mistakes before we talk about how to classify them, actually. The world is filled with malfunction right now, whether its AI-based business models, the US Senate or the water cycle in California. And our semantic assets are failing all the time; its just the frequency that differs from one system to the next. There are some fantastically stable ETL jobs out in the world, and some pipeline transformation code that has never generated an error. There are probably ten year old APIs patiently waiting for requests right this minute, and microservices architectures that dutifully do what they're supposed to, and Excel workbooks some Finance department has used every month for twenty years. And then there's stuff that just fails all the time, sometimes of its own accord, for completely mysterious reasons.

If you look at your own repositories you know that some percentage of the assets they hold will fail this year. But you probably don't record any of these failures, and the very idea might seem somewhat distasteful. This is odd, considering that we're in the business of manufacturing and distributing data. We don't know, for example, how many times someone looked at a report and thought "oh jfc not this thing again." At the very least, we can't report what the assets in our repositories are doing, like how many of the stored procs written by long-gone people still work. We simply don't track data movement failures at any level of detail.

So what percentage of your assets will fail this year? Is it 10%? 20%? Let's say you've got a gigantic repository of reports based on Crystal Reports sitting somewhere in your organization. Its business-critical to some portion of your customer base or you wouldn't still have something like that, and there's a requirement written somewhere that you need to support your professional services people somehow. Do you even know off the top of your head who owns and/or supports Crystal Reports as a product anymore? Now tell me how bad it is: Name the interdependencies between those reports and data models in various systems, which scrum teams those models belong to, and what those models were intended to support. If you can do anything like that analysis for even your own working ETL jobs or data science pipelines you're in a small minority.

Or you may have a bunch of web-scrapers written in node.js supporting a DevOps team supporting your software engineers. You don't know, as a manager, how those things got there; they may have been written by an intern with a coke problem a couple of years ago, for all you know. Or you may have an entire team that reads faxes all day and enters medical orders into a system, a data movement function you find all over the Healthcare industry.

Or suppose you've got a web analytics implementation and no one uses it. That is, some people use it, and your BI team has some people who can drive the reporting tools, but for the most part the Marketing people you bought it for complain it doesn't do what they want it to. You can do an in-house upgrade by cobbling together resources across a couple of teams that may work together all the time anyway, even though they're under separate budget lines; or you can hire specialists with dedicated expertise to do the upgrade. But how many semantic assets are involved in either implementation, and how often do the current ones fail, and what types of issues do the assets have that cause them to fail? I.e. do they fail because someone keeps changing the source-to-target mappings, or the natural keys, or what? And what does the upgrade propose to do about those failure points? What's the build vs. buy comparison when it comes to long-term failures? Given that you've got some idea of what your web analytics implementation is supposed to tell you, or at least what you want to know in say the next three months, how much of that are you getting?

There's a lot of money floating around in these systems. Now it may be that you blow $50k here and $100k there on web analytics and you're just spending the money because someone asked you to. But you could, with answers to those questions, start seeing some really cool stuff happen. And maybe money is no object, but we all have an obligation to minimize the bullshit job phenomenon.

How many of those things that make up your data management environment are going to fail between now and Q319? Can you tell me even how many things you've actually got? I'm not asking about the quality, accuracy or even provenance of the data that shows up in your Tableau dashboards, because those are advanced questions. I'd just like to know the number of transformation steps between where manufacturer and consumer and how many of those steps went "bad" last year, for some value of "bad."

You probably can't answer either question. I understand there's a sense that building any theory around what we do in data management seems kind of pointless. But the fact is that we don't currently do anything to measure our failures in data management or how those failures came to be, or predict where they might happen in the future. Again, this is very odd for people who are in the business of making and moving data. Its pretty easy in most physical manufacturing systems to figure out which lines produce the most errors. (Want an example that makes Industrial Engineers drool? Read this blog from eFlex.) Its also pretty easy to do analytics on logistics issues so you can forecast which categories of delivery problem are more likely which times of year - hurricanes in the Fall, snow in the winter, fires in the West all year, for example.

So the reason I haven't thought it necessary to explain how to identify failures so far is because you already know, if you're reading this, that you have failures. You get them all the time. And you just fix them when they happen. You may prioritize the fix by which customer needs it, and you're probably busy enough as it is. But you don't actually do more than that.

And if you actually do more than just prioritize fixes by customer, it might stretch to recording who built the specific pipeline between the source and the target.

And its not hard to record more than you do, obviously, its just that we really haven't spent any time thinking about our repositories more than "this is Priya's stuff" and "this is Doug's stuff" and "she isn't here anymore, but we haven't had time for anyone to take this code over."

But again, you've got no way to measure whether you're doing the right thing in data management. You know by feel but you can't prove it to anyone. There appears to be no quantifiable benefits to good data management practices, whether that's avoiding cartesian joins at the source or pie charts at the ultimate target.

One way to think about the problem is this: If you were asked to build a database to record all the mistakes someone was making in running their data warehouse or big data or analytics operation, and use that to figure out how to make more happen with less work, how would you do it? We spend a good portion of our days as data warehouse or big data or analytics people storing and interpreting the data produced by instrumentation used to measure other people's work, but there's no instrumentation on our practice.

Now that we've broached the idea, though: I'm always in favor of proactivity.

So let's think about how we find out about our mistakes.

Anytime I'm presented with something that seems to be a hard question I like to proceed systematically. Ecosystematically, to be precise, and while I won't map out the ecosystem I think we can get a sense of what it looks like logically. This is a very boring approach, so if you've been wound up by the previous prepare to yawn. I will put these notes into a more rhetorically effective format shortly, but I think the logical approach ensures nothing obvious slips through the cracks.

Before we go further we should review how assets fail. An asset fails when it no longer meets the need it was intended to meet. When it fails to meet that need it goes into semantic debt. A semantic asset is a subset of a source dataset that means something to a target. Targets can be people or just locations, but there's an expectation at the target that the asset contains some quantity of data in some particular format. We discover that an asset has failed in one of three ways, and there are four types of failure because we only do four tasks in data management.

So there's two ways to know that an asset has failed:

Someone tells us the asset has failed.
The asset or associated assets tells us that its failed.

The first happens all the time, but its interesting to think about how. Generally when people look at a report that doesn't do what they want, they'll take one of two actions.

First they might simply complain to the report author that the report doesn't work anymore - that it doesn't have the categories they need, the time scale they want, the comparisons they need to make, etc. This often occurs when something happens in the business that renders the business logic in the report moot. If we consider the Attribution report example we've used multiple times: It may be that as of yesterday Marketing is trying to track a new conversion channel, and doesn't see the results in their Attribution report this morning because the source system for the report simply doesn't collect the right data. So the Attribution report becomes a problem, and hopefully someone lets the DW team know they've got to get the right data.

If you think about how that failure happens there's a whole series of identifiable operations that led up to the "failure" of the report, from Marketing's perspective, and a whole series of identifiable operations that can be done to fix it. Both the operations left of failure and the ones right of failure have a cost. TCO on the report is something like the long-term efficiency of moving from left to right on that particular Attribution report. But the report is an asset, and there's a chain of assets preceding the report. At some point in the chain something failed, and so all the assets more or less have failed.

These are straightforward cases of failure: The BI layer isn't getting what it needs. But we can get much more granular and thus effective with these cases. This failure might be the kind of thing we can match against requirements (in the RQ function) and determine who's at fault, i.e. it turns out the report developer didn't do what was asked for. Or it may be that there's a new requirement and we need to go through whatever process is expected of us.

Either way these proactive communications - someone tells us face-to-face or via email - are a virtuous cycle that should be nurtured.

The second kind of action is not straightforward. As DW people we've all been in situations where users find a problem and don't tell us: The report we spent months building for the users is inadequate, but in their frustration with our process or our incompetence they just decide to abandon it. In situations where a user group stops using a report we might have near-immediate indications. For example, if our Attribution report used to be the first thing everyone opened in the morning and now it isn't, and we can see that by looking at our reporting-tool server logs, then we have a really obvious way of monitoring failure. We check for dropoffs in usage statistics, in the same way that television executives look to see where in the TV show people stopped watching or ebook providers know where in the ebook people stopped reading.

In situations where the report is heavily used but the period of its use is infrequent, this is a much harder problem to monitor. If our Attribution report is only checked once a quarter, perhaps because its used to pay out to affiliates, then its obviously going to take much longer to collect statistics that would allow us to compare use.

Note that in assets that are the result of long chains of data movement, the failure might not be in the asset where you spot the problem. I.e. sometimes the asset itself isn't the problem. The semantic subset of the source might perfectly reflect the requirements and the source, but its the source (or even the requirements) that are the problem. We may see the Attribution report and find nothing wrong with the report itself - i.e. with the Tableau workbook that displays the data - or the DW the data comes from. Its at the source that populates the DW that we discover there's a problem, as in the quite prevalent case where Marketing adds an attribution channel and for some reason that data doesn't flow through.

You might also say in this case, "there's more than one use for the data, its just the Attribution report that's gone wrong. You can't say the entire chain has failed just because Marketing forgot to tell us they added a new channel." An excellent point, except that the missing channel indicates there's an architectural problem in the way you're handling your data. I mean, you can pretend its not bad, or could be worse or that its not your fault. But you're handling the data, right? So its an architectural failure. If you don't fix the underlying architectural problem the failure will recur. So quantify the cost of the current failure and multiply that by the number of times you expect to make the recurring fix, i.e. your semantic debt, and compare that to the architectural cost of a real fix. "But I don't know how many times I'll need to make the fix Dave," you reply. Then you need to go to your user group and ask them what their plans are.

But in this case the problematic asset the audience sees only inherits the failure of the source assets. You've got a good ETL architecture and you can demonstrate this to your managers. One thing your architecture helps with: You can easily find and fix the cause of the problem. This is a positive attribute of doing things the right way, and you should tell your boss that's why you did it that way! Instead of spending money combing through spaghetti code packed into lasagna layers, you've got an architecture that allows you to isolate the cause of the semantic debt. You can probably even quantify the counterfactual!

It may not change the mistaken impression of who's at fault, of course, which is why proactive nurturing of virtuous cycles is so important. (You need to make sure the people who find out about problems in your user base, which is your BI or reporting or QA people, have a confident and neutral relationship with your users.)

So we've got two submechanisms for determining that an asset has failed its intended audience. First, we can depend on our users to tell us something has gone wrong. Second, where we can't depend on our users to tell us we can monitor their use of the stuff we built, like Big Brother, to see if there's any significant variation in that use which might indicate a problem.

2. The other way we know that an asset has failed is because it or its associated assets tell us that it failed. This is a lot more obvious than it sounds.

One easy way to know that source data has changed or that a target has changed is an automated-process failure. If the ETL job stops running or goes from running lickety-split to taking hours or days, its quite likely there's a failure at either end. Long running jobs might also be a result of broken indexes or something similar, but for the most part, absent any significant changes in search infrastructure at the source or the target, an ETL job fails because the source has changed.

You can graph the more gradual failures, if you think about it. Sudden failures like broken jobs are easy to spot. Gradual failures can be spotted by looking for plots of execution times that seem to grow faster than the volume of data transferred. And if you know that an asset has failed, and you know which assets are downstream of that asset, then you know the downstream assets have failed as well.

So how do you find all of these assets? You've got repositories of this stuff. You've got relational db system tables with stored procs or ETL package configurations, Github trees with code organized into more-or-less straightforward processes, file servers with the start- and endpoints of transformations. You've got scripts and Tidal jobs and Jenkins jobs.

Now maybe your people haven't been so good at normalizing their naming conventions and so all those pipelines in Git written by your younger Data Scientists have what might be called an "evolving" set of labels for sources and targets. This isn't a surprise. I did this and still do this myself. Its called Agile, man.

But that is semantic debt and we can actually quantify its cost. When we track the number of problems caused by that code in particular - by the pipelines Doug wrote - and quantify the number of hours it took to fix it, as well as the two hours of pain & suffering it took Deepam to explain the fix to Marketing or whoever downstream expected to use that data, we can quantify the semantic debt in Doug's repository. We can put a dollar value on the cost to the org of doing things the wrong way.

Now if we start quantifying the types of the failures - Doug isn't good at keeping his naming conventions consistent, so anyone using his stuff has to watch for Type 1 debt - then we can see where our problems might arise and how we can go about fixing them.

You can't tell me that extracting your ETL and/or pipeline and/or reporting code out of system tables or online repositories, identifying sources and targets, and creating assets out of the transformation blobs in between S and T isn't the kind of thing that'd take more than a couple of days to do. You don't need to try this for Adobe Analytics or the solid-state transformations that happen inside e.g. Axapta. Just do it for the stuff you've personally built. Extract ETL/pipeline logs into the Semantic Fact model so every time a process executes you've got its context. Apply some intelligence - some "machine learning" as well as just some BI - to those facts to see what's happening.

So its really very easy to count mistakes. First we need to record them, and we don't do any of that at all.

Semantic Debt

Wednesday, November 14, 2018

Thoughts on counting mistakes in data management

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse