An analytical matrix is a model of the world. This idea shouldn't be new to anyone who reads stuff like this, but let me explain a little further.
(I'm going to switch the names around here so anyone who's followed this closely might get a little confused for a moment. I don't think anyone is actually following any of this that closely anyway, but the names of the axes have been bugging me for a while and I don't think they fit. So I'm going to change them to something more appropriate.)
What I've proposed is that you can classify all of your data management problems using a 2X3 matrix. The Y axis of the matrix is the Debt Category, which breaks down as follows:
My claim is: If you look around your data management shop you'll find you've generally got people working on two types of problems.
- They're trying to solve Equivalence problems: They're writing reports or building pipelines or ETL processes that combine different schemas (Type 1) or resolve different natural keys (Type 2).
- They're trying to solve Coverage problems: They're trying to figure out the schema for a set of files or a storage system (or trying to enforce governance rules on the same), or trying to figure out how to store something in an existing schema.
But - and this is very important! - each team will only ever have one of these four types of problem. They're either working on how to glue two sets of data together, or they're trying to figure out what the data means. Sometimes you need to do the one before the other, but this is all there is.
By the way, there's no real division of labor here. Its not as if BI people are hired to work on Equivalence debts and ETL people work on Coverage debts, because they don't. In the course of building a report a BI analyst will have to figure out what a schema means, often by analyzing apparent natural keys, and then figure out how to merge the schema with something they've already got. The ETL people do exactly the same thing. The primary difference between BI and ETL is that one person's output is a visualization (which might just be a table) and the other is a table or file. If you're a data scientist then you're doing both jobs - but hey, they're calling you a scientist so that's worth something, right?
So there's only ever four kinds of problem that data management people wrestle with.
The second claim is that there's only three ways the data management team discovers they've got these problems. They can be surprised by them (duh!), in which case they've got a Sudden failure. They can know about them because they're a consequence of the design or scope of the systems they work with, in which case they're Immediate failures. Or these failures can build up gradually, where no one specific failure is significant enough to be a Sudden failure but over time enough debt builds up to cause a problem. These are Gradual failures. These Failure Modes are our X axis.
So this is it, is what I'm claiming. You can design your fancy stateless ETL solution around this matrix, hire people who specialize in one kind of debt or another, or go out and make a lot of consulting cash analyzing an organization's data management plans using this matrix.
But its not enough for Erin and people like her, which is to say everyone who doesn't live a life of enforced leisure such as myself. So the next step is to cross-reference the matrix with the problems it solves, as follows:
Now that we've got the dimensions for the cube and we've filled in the cells with corresponding problems, we can flip the matrix around and figure out what to do with our problems. To use the Index, you look up your problem, determine which Debt Category and Failure Mode its an example of, and then apply the mitigation.
Solving Coverage problems takes a specific kind of skill, one that's very different from playing around with fields to figure out an appropriate natural key. If you've got a lot of Type 2 debt you're better off looking for someone with a lot of business knowledge to help glue schemas or keys together than you are asking a newly-hatched BI analyst to solve the problem. If most of your problems fall into the Equivalence category you should build or expand your Data Lake (or build something more sophisticated - ask me how!) so you can start to work on the semantic components you're going to turn into joins. On the other hand, if you're anywhere in the vicinity of the Blackberry situation you need to rethink your conceptual model.
You can find a more readable version of the Problem Index here.