Tuesday, September 18, 2018

How to use the Semantic Debt Problem Matrix to solve problems

So how do you use the giant matrix I put up yesterday?

An analytical matrix is a model of the world.  This idea shouldn't be new to anyone who reads stuff like this, but let me explain a little further.

(I'm going to switch the names around here so anyone who's followed this closely might get a little confused for a moment.  I don't think anyone is actually following any of this that closely anyway, but the names of the axes have been bugging me for a while and I don't think they fit.  So I'm going to change them to something more appropriate.)

What I've proposed is that you can classify all of your data management problems using a 2X3 matrix.  The Y axis of the matrix is the Debt Category, which breaks down as follows:


 
My claim is: If you look around your data management shop you'll find you've generally got people working on two types of problems. 

  1. They're trying to solve Equivalence problems: They're writing reports or building pipelines or ETL processes that combine different schemas (Type 1) or resolve different natural keys (Type 2).  
  2. They're trying to solve Coverage problems: They're trying to figure out the schema for a set of files or a storage system (or trying to enforce governance rules on the same), or trying to figure out how to store something in an existing schema.
In a more traditional data management team this work is done by BI analysts and ETL developers.  In a data science group individual data scientists will wrestle with all four problems.  In a large enterprise group there may be a team of business analysts working on one problem at a time, and then handing off what they've found to specialists.

But - and this is very important! - each team will only ever have one of these four types of problem.  They're either working on how to glue two sets of data together, or they're trying to figure out what the data means.  Sometimes you need to do the one before the other, but this is all there is.

By the way, there's no real division of labor here.  Its not as if BI people are hired to work on Equivalence debts and ETL people work on Coverage debts, because they don't.  In the course of building a report a BI analyst will have to figure out what a schema means, often by analyzing apparent natural keys, and then figure out how to merge the schema with something they've already got.  The ETL people do exactly the same thing.  The primary difference between BI and ETL is that one person's output is a visualization (which might just be a table) and the other is a table or file.  If you're a data scientist then you're doing both jobs - but hey, they're calling you a scientist so that's worth something, right?

So there's only ever four kinds of problem that data management people wrestle with.

The second claim is that there's only three ways the data management team discovers they've got these problems.  They can be surprised by them (duh!), in which case they've got a Sudden failure.  They can know about them because they're a consequence of the design or scope of the systems they work with, in which case they're Immediate failures.  Or these failures can build up gradually, where no one specific failure is significant enough to be a Sudden failure but over time enough debt builds up to cause a problem.  These are Gradual failures.  These Failure Modes are our X axis.


So this is it, is what I'm claiming.  You can design your fancy stateless ETL solution around this matrix, hire people who specialize in one kind of debt or another, or go out and make a lot of consulting cash analyzing an organization's data management plans using this matrix.

But its not enough for Erin and people like her, which is to say everyone who doesn't live a life of enforced leisure such as myself.  So the next step is to cross-reference the matrix with the problems it solves, as follows:


Now that we've got the dimensions for the cube and we've filled in the cells with corresponding problems, we can flip the matrix around and figure out what to do with our problems.  To use the Index, you look up your problem, determine which Debt Category and Failure Mode its an example of, and then apply the mitigation.

Solving Coverage problems takes a specific kind of skill, one that's very different from playing around with fields to figure out an appropriate natural key.  If you've got a lot of Type 2 debt you're better off looking for someone with a lot of business knowledge to help glue schemas or keys together than you are asking a newly-hatched BI analyst to solve the problem.  If most of your problems fall into the Equivalence category you should build or expand your Data Lake (or build something more sophisticated - ask me how!) so you can start to work on the semantic components you're going to turn into joins.  On the other hand, if you're anywhere in the vicinity of the Blackberry situation you need to rethink your conceptual model.

You can find a more readable version of the Problem Index here

Monday, September 17, 2018

How to talk to your manager about Semantic Debt

There's a long post still coming, but for the last three weeks I've been busy with (a) a fruitless bowhunting trip, (b) a book proposal and (b) a stunning flu that left me and my family lolling around the house like indolent aristocrats.  Even the dog is embarassed.  So I haven't had a chance to get the matrix I promised filled in.

But there's some detail we can add to our matrix that's both useful, topical and untheoretical:

I often try out my architectural thoughts on a friend who functions as a one-person data warehouse team wherever she goes.  She's capable of doing anything from the moment someone says they want one to the moment she hands you the keys, but she'd really rather not do even most of it anymore, because its done better with other people's help.  She doesn't feel a lot of need for architecture in Data Warehousing because she's already pretty good at it, and happy to ignore it as well if its dumb.  Since most projects end up going down a couple of different paths, ones that anyone who's done more than once can start to recognize, there's only a couple of ways to start and proceed that avoid those parts of the process.  She generally replies to my thoughts, if they're bad ones, with polite silence.  If they're good ones but possibly not doable in the three-month cycle her strategic timeline has been reduced to, she replies with some combination of encouragement and practical concern.  She'll say something along the lines of "very interesting Dave but why should I care?"  (It doesn't sound as mean when she says it.)  And her practical concern with semantic debt is: OK let's say this is the right way to describe the problem.  How do you convince someone to do something about it?  If there's a project going overboard right now that you could save if you did the right thing (and if what I'm saying is the right thing to do), then what would you say to people to get them to do the right thing?

This gets to the heart of any big-picture problem in data management: No one believes they've got a problem (except for the data management people) until suddenly they've got it.  Every data management challenge is going fine until it ends up in the emergency room.  Almost everything in data management appears to be a sudden failure that last for twelve months. 

I've been thinking about my friend's concern for a while, as I keep expanding the framework.  How do you convince an organization that there's consequences to bad data management?

One way to do it is to predict what will happen if they get it wrong.  That may not be the best approach, but its a start.  So here we go.  I'm going to go through the matrix and give you consequences for each cell.  That is, if you've got semantic assets in Sudden failure with Type 3 debt, what happens to you?  So here's a first cut, with some very specific examples.



I've put a Google Sheets version here.

Thursday, September 6, 2018

The Rule of 3s

I wanted to do something shorter and more process-oriented while I work through the meat of the Assessment matrix.  Data architecture is almost the funnest thing you can do for money, honestly.  At least in the top 5 things you can do for money, anyway.  But its something of a black art, especially now that many organizations have decided semantics are an afterthought.

I also wanted to try LinkedIn's article publishing system.  Avinash Kaushik uses it quite a lot, and the pieces look pretty slick.  The result is here: The Rule of 3s.


The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts