With the S->a->T model of semantic assets I pictured in my LinkedIn article, we can start to see the outlines of a data model that would allow us to track semantic assets. Once we can track them we can start to figure out how much stupid design decisions cost us.
S-a-T is the bare portion of the fact. You need dimensions for source and target. The SOURCE dimension is a location, plus an owner and a file format/device/delivery mechanism and so on. The TARGET is the same kind of dimension.
Can a SOURCE be a report and a TARGET a human? Why not? We're just as concerned about reports becoming inaccurate for actual real live people as we are for extracts that are no longer meaningful for their stage tables. And since the asset portion of the model is a syntactic representation of the semantic relationship between S and T, there's no reason we couldn't make the asset an e.g. Tableau report or an SSRS RDL file or some package of R code like the kind of thing we see in the new integrated IDEs like Mode.
Did I just say that the asset was a syntactic representation of the semantic relationship between S and T? Yes I think that's true. Semantic debt is the result of the failure of the syntactic representation/asset to satisfy the intended relationship between source and target.
There's a thing the target needs to get out of the source(s). The source asked for it, spec'd it, intends to use it. That thing is supposed to help the target do something: Buy a car or a box of gluten-free low-carb energy bars; call customers most impacted by a backlog in fulfillment or explain to investors where their cash is going; send a signal to the braking module because the pattern on the sensors means the vehicle is about to hit a deer.
That thing is the asset. A philosopher would say the asset subs-in or serves the function of the semantic relationship. Function in the mathematical sense, the way Frege talks about it: The asset is the sense of the relationship the target sets up with the source. The semantic relationship is the signal conveyed between the two. The signal is carried by the asset, which when interpreted correctly shows a bunch of symbols.
Okay enough philosophy. So the S, the T and the a are all dimensions. The a can be one of four types. In fact lots of assets will be more than one type, which is a practical problem. That's ok, though; we just make ASSET_TYPE and ASSET_DIM into a many-to-many relationship. If you've architected your data management system in such a way that you can reduce each data movement operation into one of the four types then you get better resolution in your reporting. But if you haven't done that, and you've got (reasonably) complex assets that do more than one kind of thing, you can still fit them into the S-a-T model and do the reporting.
Two points. First, there's this whole question about "reducibility" and "complexity." Some people have three-thousand line stored procs that move data from one place to the other and do lots of crazy voodoo to that data in between. What type of asset is that? Isn't that kind of thing just too complex to fit into a model as simple as S-a-T?
No not really, because its still a semantic asset: That script is just a bunch of smaller semantic assets, each of which does one of the four basic operations, combined into one large operation. (And its not even likely all four operations are involved in your 3000-line script: That particular scenario is really only looking at two operations, because if you're pulling data out of a defined set of schemas you're doing Equivalence operations exclusively.)
You can easily decompose large ETL scripts or pipelines into component operations by doing some text parsing, taking each SQL component, identifying whether it does Type 1 or Type 2 stuff, and making those components into atomic facts. (I.e. identifying the S in your FROM clauses and the T in your INSERT INTO clauses, where the a is just the query itself.) You'd have a much easier time debugging and managing your pipelines if you did that, obviously. UPDATEs and MERGEs would obviously be part of some other components, so not every SQL block is standalone; sometimes there's more than a couple of irreducible blocks per operation.
But while there's a lot of value in being able to report on your assets at the atomic level you can also get a lot of value out of reporting at a higher level. Any organization with one 3000-line pipeline script is likely to have more than one, because 3000-line scripts are a style and not a necessity. That is, if you've got someone who likes to write 3000-line pipeline scripts then they've probably either infected your whole team or they've written a bunch of other stuff in a similar vein.
Second, we want to report on these things. There is always a hierarchy of semantic assets, from the individual UPSERT all the way up to the thing your manager calls a pipeline. The end goal is to identify and manage semantic assets in such a way that we can (a) understand what's failed, (b) understand what's likely to fail, and (c) assign cost functions to stupid mistakes.
If you understand your ETL architecture (or pipeline architecture, if you're not 100 years old like me) as a geographical map showing the flow of data from one part of the organization to another, then knowing there's a failure in one flow is a start. You don't need to map facts down to the immutable block of SQL or R. If you've got a large geography to cover, you'd be better off looking at the big map and drilling into specific problem areas before you worry too much about individual components. Think of your organization as analogous to this map:
In your organization you're moving data from the transaction systems, which are managed by an scrum team called "Alberta," to a bunch of other departments, one of which is called "Houston." An individual pipeline from Alberta to Houston has a lot of components and a lot of handoffs. There's still only four operations or four ways the pipeline can go wrong, if its moving data. To get to whether we've got more problems with Type 1 than Type 3 we need to get down to a much more atomic level, but we can still track pipelines and their components at higher-order levels.
So while it'd be great to monitor the links between each pumping station, we all know actual ETL design is much chunkier. If you graphed the failure of complexes of semantic assets you'd see something like this:
(You'd see it with oil & gas pipelines too, although oil & gas lobbyists will tell you pipelines never fail and besides you drive a car so nyah nyah.)
That is, some teams are better at designing semantic assets and stringing them together than others. If Team A's job is to build APIs and Team B is supposed to use those APIs to get data into the data lake, then giving team A some API management software may help fix their failure rate but it doesn't help you see the whole problem. As near as I can tell, API management software is mostly a metadata management package, with some deployment tools added in. While that's helpful, its only one piece.
So its important to understand the context an individual semantic asset operates in too. We need to know the process the asset takes part in. "Process" is a proper name here, and its likely to change over time. The processes used in order management now may not be the same as are used in three years; the pipelines we use to calculate error rates on our experimental device data will not be the same we use when those calculations are used in production devices. That's ok, because more than likely - much more often than not? - we'll reuse the atomic semantic assets in the new processes. We may get a new order management manager, for example, who wants us to deal with order backlogs in a different way. But we still need to run the same query we used to. I mean, that piece doesn't change, sometimes ever.
So we need a Process dimension as well. Here's a sketch of the star schema:
(Unlike most of the other graphics I've posted on this blog, this one is designed to be embiggened. If you click on it the colors won't go all weird. I also understand there's potentially a lot of value, monetary or otherwise, posted here. I think this is a data model pattern, like the PARTY model, so there's not much point in keeping it hidden. Also, no one is paying me for anything right now, but you got to spend money to make money.)
If we load this up with our assets, at whatever level in the hierarchy you can get, we'll start to see some patterns. Don't worry too much about filling the process in right away, because you can discover it.
For example, for almost component that isn't an ultimate end-point target or a straight-from-the-customer-click source, there's a chain of assets. Do a loop on the fact table and for every source that's some other fact's target, join the two facts, like this:
That's an organic "process." Some components of it may be managed by e.g. "Order Management" or "the Madison QA office," in which case there's a hierarchy of responsibility for the process and subprocesses. Which is good, because now we can start to figure out who owns the mistakes we see - who owns the pipeline failures. But given that virtually everyone puts their stuff into repositories of various sources now, it shouldn't be hard to get the facts together and then tie them into a long chain.
The astute reader may have noticed that I've only got one S on my fact, but many assets have more than one S attached to them. Ah yes very astute. You can handle it in one of two ways: Each S gets its own line, or multiple Ss associated by a single asset are clumped together and treated as immutable. I kind of prefer the latter, and then later on in the day I kind of prefer the former. I need to think through which is more valuable, from a reporting standpoint. But this is a "cancellation of terms"-type problem in the SQL - you'd use a non-aggregating GROUP BY to get rid of the duplicate assets - that you'd use to put together the organic process, so either, I think, will work for now.
Once we've got that fact table, we can start to add some costs. Notice we're missing a FAILURE_MODE slot on the fact. There's some more work we need to do on that piece, which I'll start tomorrow.