Semantic Debt: Building the Semantic Fact for your basic Fortune 1000 company

We've got the sketch of a data model here. The model will get some context in this post, and we'll fill out a little more of what's in the dimensions and how to identify facts. Once we've got some idea of how to generate facts we can start to add some costs and infer some behavior. And then we can tell our bosses why something will work and why it won't.

So let's work through some implementation questions. (I will try to mix as few metaphors as possible.) There's two big questions - or maybe two-and-a-half.

1. How do we get a list of semantic assets?
2. How do we know an asset has failed?

How do we get a list of semantic assets? Its relatively straightforward to do this if you've got a single DW system, which probably doesn't have more than a single ETL/pipeline management system between it and its sources. Let's walk through the most common use case:

DW: Suppose a single target corporate reporting database, usually with a consolidated data lake-type model. It may be a mess or it may not, but its probably gotten better in the last few years if anyone has paid any attention to it.
Core TXN: Suppose one centralized ERP-type system for managing transactions (or "txn" for short), which might also be the org's core homegrown product-selling transaction system. This system may be a thing of multinational beauty managed by a handful of geniuses, or it may not. They may even just be lucky.
Core CRM: Suppose a CRM of some sort, likely Salesforce but also possibly attached to the ERP system if the org is an e.g. SAP, Axapta or Oracle shop. It may be used by a small sales force to track deals or an enormous customer-service team to track complaints, or both.
Core Digital Comms: Suppose at least one ESP. This might be attached to the CRM as in e.g. Eloqua or Responsys. It might be standalone. Many orgs have a weakness for multiple varieties.
Core Digital Collector: Suppose at least one kind of digital analytics collector. Google analytics doesn't count because you can't extract the data into the system in (1) . ;)
Reporting: Assume there's a corporate BI team and they have a preferred analytics or reporting tool. Whether they use Tableau or they've got SAP and rely on a universe-based system like BO, ultimately this thing will have a centralized box. SSRS is also something you can get data from pretty easily. There will also be a lot of Excel and PowerBI use, no matter where you go. There may also be reports of some sort created by a development team for internal software-team use or external customer consumption.
Movement: Movement of data between systems will be handled in at least one of two ways.

First there will be some formal, more-or-less off-the-shelf tool like Informatica, SSIS or Datastage in use by some portion of the organization.
Some other portion of the organization will use Python to move data (or something in a similar vein, possibly Java) (yes I know, "JAVA?" But I've seen it).
Its a fair bet that neither group will be aware of the other or, if they are, may feel that what the other is doing is actually harmful. It will almost always be the team with the OTS tools that provides data to the DW, and may even manage the DW. The Python team may extract data from the DW and its sources and put it into a cloud-based analytical system or Hadoop or one of the newer semantically-amorphous tools like Splunk for use in a "customer analytics" program.

The former group will call what it does ETL When ETL is built there's one or more guys who specialize in it, who are "drivers" of the tool, and they will build their ETL packages for long-term maintenance by other ETL people.
The latter group will call what it builds "pipelines." When pipelines are built its usually one task among many on someone's list, they may or may not have built one before, and it will be architecturally indistinguishable from something disposable.
A third group may take care of APIs.* APIs are as robustly designed and managed as ETL but built by the people in 7.3.1, but written in languages and using the style of 7.3.2.

Requirements: Finally, there's some combination of project management/program management/scrum/whatever institutions the organization uses to make things happen. These institutions support the requirements-gathering, analysis, prioritization and tracking process. There will in practice be several of these, and they fall into two categories:

Formal groups, with budgets and procedures and everything. Because they have a budget their mandates will necessarily be limited. They may be focused entirely on feature updates to (2) and (3). They may think that anything that comes out of (2) or (3) is also in their purview and thus take on some institutional role in the population of (1) and the use of (7). They may be easy to work with or hard to work with.
Informal groups. These are usually networks of relationships between the various user communities covering (1) through (7). They're designed to pass requirements from target to source. Orgs that lack effective formal groups like the kind in 8.1 will have lots of informal groups. The data will find a way through, eventually, and 8.2 groups are the org's metabolic response to the problem of data.
Either type of group may use tracking software like Jira or Trello or MS Project or even Basecamp or even think Slack is their project tracking software. It will not be easy to get this data.

That's your basic modern internet-enabled first-half-of-the-21st-century corporate data architecture, in the wild. There's 8 components that generate metadata. A few of them can be grouped as sources of data. A few of them are targets. Two are responsible for getting data to move from sources into targets; one of those functions is Movement, and the other is Requirements.

It looks a bit like this:

You'll be able to see this if you click it! I promise!

I think its worth asking a question about this diagram before I explain it: Which dimension(s) do you think use generic icons? Your answer(s) will tell you where you probably need to do some digging.

We map our use-case into the architecture as follows:

All the source systems end up in the SRC layer. That includes (2) through (5).
The DW layer or (1) is the DW layer.
The Reporting layer is the BI layer.
The Movement function is in the MV column. It can see all three layers.
The Requirements function is in the RQ column. It can also see all three layers.

There will be some variations in instances of the use case:

More than one ESP or web analytics environment. (Again, Google Analytics doesn't count because you can't extract any raw data.)
More than one Core TXN or CRM, particularly if the company has just gone through a merger.
Prevalent use of Jira, Tidal, Jenkins, Nagios, even ANT or some other operational automation and management tools, each of which might possibly theoretically (maybe) generate useful metadata about internal processes
More than one reporting tool.
More than one DW.

Often there's a customer-experience focused software team that wants to build its own event-logging system. They'll be desperate to avoid the burden of a consistent semantics so it'll be little more than a petabyte-scale sandbox full of abandoned toys, if its any age at all.
But there may be less messy DWs built for marketing, customer service, finance and FP&A groups, supply chain, even HR. These will be of varying quality. Some will be interlocking conformed star schemas that look like they were built by a spider on too much CBD. Some will be simple data lakes or local collections of what used to be called "supertxns" or large denormalized extracts. Many marketing DWs are simple a thousand IS_A_ fields glued onto a set of customer attributes willy-nilly because they can be.

Nothing about these variations really adds much complexity to the task of getting assets except the Excel component, mentioned in the Reporting bullet above. My advice is to avoid any single part of the landscape that uses Excel, as long as you can. If you want to look proactive identify a very few critical spreadsheets and apply The Scotty Principle to the remainder of the Excel swamp. Most experienced people would agree this is good advice: Exploration of the Excel swamp is for interns.

You may look at our architecture diagram and think its too simple. Its too "thin" as the philosophers say. Not so: I never met a thin concept I couldn't make into a thick one. (Indeed, some of what I'm working on here is a way of showing that all thin concepts are actually thick ones. Its a kind of an in-joke shared with former grad-school colleagues who'd never read something like this.)

So instead think of it this way:

This is three separate process flows to get someone a Marketing attribution report. First we get the data out of our Core TXN and CRM systems, merge them in the DW, and move an extract of that into our BI layer for use elsewhere. Second we move that really useful merge-extract asset that Jose in Marketing discovered into the Digital Communications application. Maybe this thing is as simple as providing an XREF between customer keys in the CRM with customer keys in the CRM for consumption by the website. Or maybe its also orders or customer complaints or zip codes or prefectures; whatever it is, Jose really likes it and he'd like to see it in use in the digital marketing suite. The third process is getting the digital marketing suite to output an attribution report - this is what Jose was looking for all along, but he had to go through the first two processes to get to the third.

Notice that for each process there's a bunch of lines into the MV and RQ functions. Something has to move the data from SRC to DW to BI, and back again into e.g. the digital comms systems. Sometimes that's formally in the MV function, but sometimes the SRC system can take care of itself. There's still a script in there somewhere, but good luck getting it.

Someone also has to request the data be moved. This is where in any organization the whole system falls apart, obviously, because these are the people who manage the expectations, the intent, the meaning behind the whole thing. For any given semantic asset, if you don't know what the ask is and why its there its impossible to know, except by pure force of educated speculation, what the point of the asset is. The ask may be stored in a long long set of comments fields in Jira. It may be that Jose's attribution report lives on long after he's moved on to bigger and better things, but everyone knows what it means implicitly and its become part of the oral tradition in Marketing, the modern equivalent of ancient Greeks reciting passages from Homer. Or frankly it could be the nasty scrawls of a busy oncologist in a comments field, ordering up a care plan.

In any event, this is the point on the diagram where the meaning comes in. If you think that's hilarious, because no one at your organization would even think in those terms and you certainly don't do a great job of documenting the Ask, then you've probably got a big problem. You're probably closer to ancient Greece than you think.

Another way to look at this use-case:

Hey some of the layers are reused! Well isn't that amazing. Looks a bit like this:

I call this the "Flag View" because you can map the three key parts of the semantic asset creation process into it. The piece in the middle is immutable, as are the two bookends, like these two:

That is, even if you don't know your atomic units - your atomic sources or targets or assets - you can still treat the whole as a component of the fact because they're fractal.

Many people who I've worked with will remember that I think the simplest definition of a FACT is who bought what from where, when did they do it, and for how much? In the Canada of my youth this was known as "the 5 Ws", which is to say who what when where why & how.

What we're doing here is filling in the slots in the fact, the variables in our "5 W" question. In this case our fact is the variables in the diagram above. That's the mystery of data warehouse modeling, for the star schema at least. Build systems that can fill in the gaps on the 5 Ws for any given process, and you've got the basics of a data warehouse. Anywhere. This is one reason data warehouse architectures are particularly suited for quality management regimes, like Six Sigma.

At a high level to get a list of assets we need to get the extracts, with their sources and targets. We also need to know how those extracts are executed, through the MV function, and what the Ask was, via the RQ function. We're going to collect those data points so they match our architecture and stack them, like this:

Sometimes those assets will have owners or processes that can be identified. Sometimes they'll have more detail in the MV or RQ functions, or they may just be gigantic blocks of stuff no one can figure out. (That's why you've got that super thick fact in the middle of the stack: It does a whole bunch of work and no one has had time to decompose it into more atomic thin facts.)

In practical fact the easiest place to get all of this data is out of the ETL tools and the pipelines. Even if the pipelines are disorganized, use multiple coding styles and conventions or languages and are of different qualities, all you need to look for is that S->a->T structure. You need to identify a source, a target, and make the extract the asset. If there's multiple steps in the asset - if all you've got is a TXN and a CRM for a source, an attribution report for the end, and a huge ten-year-old Informatica or SSIS package as the asset, then that's what you've got. You should consider that an example of a thick fact, as in the stack diagram above.

And, alternatively, if you aren't using a formal, well-documented ETL tool for your repository then you're probably in some variation of the pipeline system anyway. Even when I've managed ETL teams that used SSIS with immense amounts of governance pressure the queries at the hard center of each SSIS package used different styles, the property-naming conventions were all over the place, the use of variables ranged widely - its just a fact that in even the harshest Big Brother governance regime people will do things differently.

In which case you're still better off writing a set of tools to extract the S->a->T structure out of your scripts, XML variants, and relational repositories.

Where do you get the request data from? Wherever you can. At this stage of the game if you can identify a business or functional owner for the semantic asset - the person who asked for someone to build the attribution report - then you make them the RQ. But you may not be able to do that.

Which is ok. We'll need to fill these things in later.

Next we'll think about how we know when a semantic asset has failed.

Semantic Debt

Wednesday, November 7, 2018

Building the Semantic Fact for your basic Fortune 1000 company

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse