Wednesday, October 31, 2018

Calculating the cost of mistakes (in data management)

A little thinking out loud, which is perhaps what this blog has become.  But bear with me here.

With the S->a->T model of semantic assets I pictured in my LinkedIn article, we can start to see the outlines of a data model that would allow us to track semantic assets.  Once we can track them we can start to figure out how much stupid design decisions cost us.


S-a-T is the bare portion of the fact.  You need dimensions for source and target.  The SOURCE dimension is a location, plus an owner and a file format/device/delivery mechanism and so on.  The TARGET is the same kind of dimension.

Can a SOURCE be a report and a TARGET a human?  Why not?  We're just as concerned about reports becoming inaccurate for actual real live people as we are for extracts that are no longer meaningful for their stage tables.  And since the asset portion of the model is a syntactic representation of the semantic relationship between S and T, there's no reason we couldn't make the asset an e.g. Tableau report or an SSRS RDL file or some package of R code like the kind of thing we see in the new integrated IDEs like Mode.

Did I just say that the asset was a syntactic representation of the semantic relationship between S and T?  Yes I think that's true.  Semantic debt is the result of the failure of the syntactic representation/asset to satisfy the intended relationship between source and target.

There's a thing the target needs to get out of the source(s).  The source asked for it, spec'd it, intends to use it.  That thing is supposed to help the target do something: Buy a car or a box of gluten-free low-carb energy bars; call customers most impacted by a backlog in fulfillment or explain to investors where their cash is going; send a signal to the braking module because the pattern on the sensors means the vehicle is about to hit a deer.

That thing is the asset.  A philosopher would say the asset subs-in or serves the function of the semantic relationship.  Function in the mathematical sense, the way Frege talks about it: The asset is the sense of the relationship the target sets up with the source.  The semantic relationship is the  signal conveyed between the two.  The signal is carried by the asset, which when interpreted correctly shows a bunch of symbols.

Okay enough philosophy.  So the S, the T and the a are all dimensions.  The a can be one of four types.  In fact lots of assets will be more than one type, which is a practical problem.  That's ok, though; we just make ASSET_TYPE and ASSET_DIM into a many-to-many relationship.  If you've architected your data management system in such a way that you can reduce each data movement operation into one of the four types then you get better resolution in your reporting.  But if you haven't done that, and you've got (reasonably) complex assets that do more than one kind of thing, you can still fit them into the S-a-T model and do the reporting.

Two points.  First, there's this whole question about "reducibility" and "complexity."  Some people have three-thousand line stored procs that move data from one place to the other and do lots of crazy voodoo to that data in between.  What type of asset is that?  Isn't that kind of thing just too complex to fit into a model as simple as S-a-T?

No not really, because its still a semantic asset: That script is just a bunch of smaller semantic assets, each of which does one of the four basic operations, combined into one large operation.  (And its not even likely all four operations are involved in your 3000-line script: That particular scenario is really only looking at two operations, because if you're pulling data out of a defined set of schemas you're doing Equivalence operations exclusively.)

You can easily decompose large ETL scripts or pipelines into component operations by doing some text parsing, taking each SQL component, identifying whether it does Type 1 or Type 2 stuff, and making those components into atomic facts.  (I.e. identifying the S in your FROM clauses and the T in your INSERT INTO clauses, where the a is just the query itself.)  You'd have a much easier time debugging and managing your pipelines if you did that, obviously.  UPDATEs and MERGEs would obviously be part of some other components, so not every SQL block is standalone; sometimes there's more than a couple of irreducible blocks per operation.

But while there's a lot of value in being able to report on your assets at the atomic level you can also get a lot of value out of reporting at a higher level.  Any organization with one 3000-line pipeline script is likely to have more than one, because 3000-line scripts are a style and not a necessity.  That is, if you've got someone who likes to write 3000-line pipeline scripts then they've probably either infected your whole team or they've written a bunch of other stuff in a similar vein.

Second, we want to report on these things.  There is always a hierarchy of semantic assets, from the individual UPSERT all the way up to the thing your manager calls a pipeline.  The end goal is to identify and manage semantic assets in such a way that we can (a) understand what's failed, (b) understand what's likely to fail, and (c) assign cost functions to stupid mistakes.

If you understand your ETL architecture (or pipeline architecture, if you're not 100 years old like me) as a geographical map showing the flow of data from one part of the organization to another, then knowing there's a failure in one flow is a start.  You don't need to map facts down to the immutable block of SQL or R.  If you've got a large geography to cover, you'd be better off looking at the big map and drilling into specific problem areas before you worry too much about individual components.  Think of your organization as analogous to this map:


In your organization you're moving data from the transaction systems, which are managed by an scrum team called "Alberta," to a bunch of other departments, one of which is called "Houston."  An individual pipeline from Alberta to Houston has a lot of components and a lot of handoffs.  There's still only four operations or four ways the pipeline can go wrong, if its moving data.  To get to whether we've got more problems with Type 1 than Type 3 we need to get down to a much more atomic level, but we can still track pipelines and their components at higher-order levels.

So while it'd be great to monitor the links between each pumping station, we all know actual ETL design is much chunkier.  If you graphed the failure of complexes of semantic assets you'd see something like this:


(You'd see it with oil & gas pipelines too, although oil & gas lobbyists will tell you pipelines never fail and besides you drive a car so nyah nyah.)

That is, some teams are better at designing semantic assets and stringing them together than others.  If Team A's job is to build APIs and Team B is supposed to use those APIs to get data into the data lake, then giving team A some API management software may help fix their failure rate but it doesn't help you see the whole problem.  As near as I can tell, API management software is mostly a metadata management package, with some deployment tools added in.  While that's helpful, its only one piece.

So its important to understand the context an individual semantic asset operates in too.  We need to know the process the asset takes part in.  "Process" is a proper name here, and its likely to change over time.  The processes used in order management now may not be the same as are used in three years; the pipelines we use to calculate error rates on our experimental device data will not be the same we use when those calculations are used in production devices.  That's ok, because more than likely - much more often than not? - we'll reuse the atomic semantic assets in the new processes.  We may get a new order management manager, for example, who wants us to deal with order backlogs in a different way.  But we still need to run the same query we used to.  I mean, that piece doesn't change, sometimes ever.

So we need a Process dimension as well.  Here's a sketch of the star schema:


(Unlike most of the other graphics I've posted on this blog, this one is designed to be embiggened.  If you click on it the colors won't go all weird.  I also understand there's potentially a lot of value, monetary or otherwise, posted here.  I think this is a data model pattern, like the PARTY model, so there's not much point in keeping it hidden.  Also, no one is paying me for anything right now, but you got to spend money to make money.)

If we load this up with our assets, at whatever level in the hierarchy you can get, we'll start to see some patterns.  Don't worry too much about filling the process in right away, because you can discover it.

For example, for almost component that isn't an ultimate end-point target or a straight-from-the-customer-click source, there's a chain of assets.  Do a loop on the fact table and for every source that's some other fact's target, join the two facts, like this:


That's an organic "process."  Some components of it may be managed by e.g. "Order Management" or "the Madison QA office," in which case there's a hierarchy of responsibility for the process and subprocesses.  Which is good, because now we can start to figure out who owns the mistakes we see - who owns the pipeline failures.  But given that virtually everyone puts their stuff into repositories of various sources now, it shouldn't be hard to get the facts together and then tie them into a long chain.

The astute reader may have noticed that I've only got one S on my fact, but many assets have more than one S attached to them.  Ah yes very astute.  You can handle it in one of two ways: Each S gets its own line, or multiple Ss associated by a single asset are clumped together and treated as immutable.  I kind of prefer the latter, and then later on in the day I kind of prefer the former.  I need to think through which is more valuable, from a reporting standpoint.  But this is a "cancellation of terms"-type problem in the SQL - you'd use a non-aggregating GROUP BY to get rid of the duplicate assets - that you'd use to put together the organic process, so either, I think, will work for now.

Once we've got that fact table, we can start to add some costs.  Notice we're missing a FAILURE_MODE slot on the fact.  There's some more work we need to do on that piece, which I'll start tomorrow. 
   


   



Friday, October 26, 2018

New Posts

I've been trying LinkedIn's article publishing system for a month or so. Its prettier than blogger - and really anything is prettier than blogger - and it drops content right onto people's feeds. Its still not the hypertext wonderland I'd hoped it was. I like hypertext. I'd like to point to specific definitions or formulations and build on those, sort of like an encyclopedia. Also LinkedIn clips all cases of two characters after a period to one, which I don't particularly care about. I don't have a dog in that fight. But it has left me with an inconsistent approach to characters after the period.

But it works. Lots of views. Here's the two posts:

Semantic Debt: A New Way to think about Data Management Operations

There's only four kinds of failure in data management

Thursday, October 11, 2018

Managing Semantic Debt in your BI Portfolio


A couple of thoughts on how you can manage your portfolio of BI assets using the Semantic Debt framework.

In my last post I talked about calculating the cost of your semantic debt, as the resource cost of bringing the asset into code.  That is, for any given report, ETL job/pipeline, table set or Data Warehouse, how much would it cost to modify the asset so it does what people want it to do?  This isn't as hard as it sounds, in practice: If you've got say an Attribution report that needs to merge data from two systems but it currently only shows data from one, how much would it cost to get it to merge the two systems?  The cost may include building new ETL jobs and tablesets, some open-ended exploration of the source to figure out the natural keys, whatever.  But most people can get these estimates with not much friction, often right out of their ticketing system.  In fact the better-managed teams I've worked with (I'm looking at you, Shrikant) have their entire portfolio gamed out, often for quarters at a time.

Monday, October 8, 2018

Adding Costs

Next step in our "Why do I care?" discussion is putting some teeth into the analysis: The sharp, serrated, layered-like-sharks-teeth of cost.

Many of my readers work in environments where costs are studiously and purposefully ignored - we're in a post-accountability environment in many organizations right now, where bad design is assumed to be something that gets fixed after the IPO and/or when the adults get hired.  But I think there's still a lot of utility in figuring out how to measure the value of good design in data management.  Most of us don't work for Unicorns, and if you're doing data management you've got an interest in both getting a reliable system in place first, so you don't have to wake up at 2AM to fix problems that'll be spotted in the daylight, and second because actual people actually depend on your systems.  This is true whether you're curing cancer or shipping kombucha starter kits, or even curing cancer by shipping kombucha starter kits.

This is the Ideal Case for the use of a semantic asset:


The semantic asset is some discrete package of data that is (a) repeatable and (b) meaningful to its target.
Repeatable: The asset can be reproduced, even if there's a lot of steps or its composition appears to vary in quality from one period to the next. 
Meaningful: The asset has some value to the target.  It may or may not be Actionable!!l!l1, but there's someone somewhere who's waiting to consume it.
That is, the target knows what the pieces of the asset mean - whether those pieces are structured hierarchically, relationally or whatever.  And the target doesn't have to guess each time s/he receives the asset as to its meaning, as well, because the structure is stable.  An unreliable semantic asset is its own kind of hell, but not one we'll deal with today.

"Meaningful" doesn't mean the asset is simple, and there can be all kinds of rules about dealing with variations in structure, or what variations mean.  But for the asset to be meaningful to the target the target needs to know how to figure out the meaning.  If there are suddenly fields tacked on to the asset that the target hasn't seen before then they literally don't know what they're for.  And if fields are renamed or replaced and the target isn't made aware of the change then once again they literally don't know what they're for.  This might seem like ETL 101, but you'd be surprised how often even supposedly experienced developers don't get this.

In practice this is a matter of communication.  Seems pretty simple, right?  When you change your outputs you should tell the people who use them, and when you get inputs you're obligated to use but can't you should tell the people who produce them.  I once worked for a Supply Chain VP who asked each unit to map their "inputs" and "outputs" in front of the actual people who were their inputs and outputs; these very smart people, used to thinking longitudinally about their responsibilities, suddenly understood the actual scope of their decisions.  BI/DW teams are usually much more aware of their information supply chains, simply because they're much shorter - the teams are usually so small that there's at most two people between a source and the executive using the report.  But even when there's only one person to talk to about the structure and management of a particular semantic asset, things still get screwed up.

In practice this is also a matter of degree.  If you're moving JSON from some kind of document store to a relational db then the entire process is automated with as much pre-formatted input as possible, or you'll never make it work.  Heroism isn't even an option.  If you're sending a spreadsheet to someone in Finance in response to an ad-hoc request, but you've worked with the recipient a lot, then automation and change management is unnecessary.

This is the Actual Case:


Its a little more complicated.  The target gets an asset that isn't what s/he needs.  It may differ from the ideal asset in four ways.  It may also be somewhat more unreliable than s/he needs - its not as repeatable as spec'd, for example, because sometimes it arrives at 6AM and sometimes at 9AM, but its critical input into a process that starts at 7AM.  It may not be as meaningful as necessary too; the asset is missing rows or fields, or sometimes there's a whole presentation layer that seems to have been forgotten, or sometimes a field that's supposed to mean one thing means something else entirely.

So how do we close the gap between What she needs and What she gets?  Its not that hard to figure out how much it would cost to close the gap.  We can itemize the tasks and get estimates on analysis, ETL and visualization from our BI/DW team.  That tells us what our estimate is, and there aren't many modern BI teams that don't have a fair degree of facility with that kind of estimation.  In fact some teams have gigantic lists of these things, carefully sorted and prioritized.  But that list and its associated estimates is the dollar value of the organization's semantic debt.

A couple of points.  First, you might wonder which category of debt is more expensive.  There's Equivalence debt and Coverage debt, as you may remember.

Equivalence debt occurs when you've got two assets that need to be merged.  It may be, for example, that What she gets is a report about "Bounced emails by campaign" from the old ESP, and What she needs is a report that merges the Bounced emails reports from both the old and the new ESPs.  The asset might then be in both Type 1 and Type 2 debt; the former if the schemas are different between the two sources, which is virtually certain, and the latter if there's two sets of natural keys between the two as well.

Coverage debt occurs when What she gets lacks the data necessary for the operation its used for.  Consider an asset consisting of the analysis of log data, fed from a device that seems to be functioning correctly.  The analysis of the data requires some series of positive steps required to produce it: Sum these fields, count distinct values in those, transform that field using this xref so it shows these categories, and so on.  Those analytical steps will have some defensive code built in, if they're built by anyone with any experience at all, so e.g. NULLs will get tossed from sums or counts or even xref matches.  All of those rules add up to a model of the log data, a set of assumptions about how the data behaves that help transform the data into something meaningful at the target.  We have coverage debt when there's stuff happening in the log data that isn't accounted for in the model.  It might be as simple as the sudden appearance of negative values in a field where the "sum" operation doesn't expect them, leading to lower-than-expected sums that make the summary suspicious.  Or it might be that values in category fields start showing up that aren't in the xref, and so the asset breaks mysteriously or ignores those values.

So which is more expensive?  It depends on the state of your BI team.  I know "it depends" is the most useless phrase in modern English, but here's how I think it works.  Equivalence debt is easy enough to fix with an experienced developer, especially someone who's got a lot of experience with SQL.  If you've got someone who knows their way around the two sources (or even just one of them) then they can generally work out identity conditions (for Type 2) or schema-equivalence queries (for Type 1) pretty quickly.  Sometimes of course the two sources can't be reconciled, and so everyone chases down a rathole for six months doing something like "the Philosophy of Order Management." (See the beginning of this post for an example.)  But you can, as a BI Manager, probably figure out roughly how long it would take your junior people to resolve a case of Type 1 debt vs. asking the more senior people to do so.

Coverage debt is harder to resolve because its a different kind of uncertainty.  In the example we used above, of the log data shoved into a summarizing report, you could easily produce an exception report that showed data that was ignored by the first summary; so e.g. strings in the source that don't match strings in the xref.  How expensive would such a report be?  Probably not that much, and it would help you avoid all three Failure modes.  That is, there's no immediate failure, because you've covered both the requirement and an open-ended asset for everything not in the requirements.  There's no sudden failure, because the open-ended asset will tell you if something happens in the log data that you didn't expect.  And there's no gradual failure, at least not if you stay on top of accommodating unexpected results shown in the exception report into the main summarization.

But most coverage debt isn't so easy.  What happens if there's no easy way to accommodate unexpected results?  Maybe that's because there isn't anyone who knows what they mean - i.e. you get values in the log that may or may not be by design, but since there's no one who knows the design no one knows what they mean.  I suspect all device-generated data has this problem eventually; not necessarily because there's new error codes or easter eggs, but because time has a way of making us all operate outside our operating parameters. 

In those cases, the cost is much more open-ended.  You can hire a data architect to design you an customer-event reporting system, for example, but if there are complex or unknown event types in your organization the process is going to be something of a fishing expedition.  Many "Customer 360" programs falter when they start to look at the details of customer behavior, for example.  You know they bought something and you know they bought something again, but all the interactions you've got with them between those two events are locked up in your weblog-reporting systems, or uninstrumented, or god forbid lost to time forever.  Or they live in systems no one has access to, or that are generally ignored - for example, the physical plant of your ecommerce system, which may have various moments of unreliability that explain why "add to cart" trends are so spiky.
   
I think its generally easier to deal with Type 3 coverage debt - underdetermined schemas - using the open-ended "exception report" method on a single source.  In fact I think its a good way to turn junior developers into intermediate ones, because it forces the developer to think of what they don't know about the source and the role the source plays within the business.  But when it comes to larger, more diverse sources or Type 4 debt, the costs to fix it are going to be larger simply because it takes more time and the resolution is not necessarily determinate.

Add up all of these interval costs and you get your total semantic debt:







The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts