Semantic Debt: More detail on Types of Debt

Before we get to an assessment matrix for semantic debt I need to get into more detail about the types of debt. I've posted a few times about a couple of the types, but I haven't done anything like a deep dive on what the four types of debt are and why I think there's four, as opposed to three or five or twenty-two.

Here's how our discussion will go:

We'll talk through a working example.
We'll talk about how data migration generally works.
A recap.
Type 1 debt.
Type 2 debt.
Equivalence failures in general.
Type 3 debt.
Type 4 debt.
Coverage failures in general.
Why are there only four types of debt?
The end.

If you've got a lot of experience with data management you may already think these four types are obvious. You may even think there's more types, in which case please leave a comment. After you leave your comment you can skip this post.

Everyone else who's interested should sit down, because this is a little denser than usual.

Our working example

I think there's four types of debt for empirical reasons. The relevant empirical experience here is in data migration, which tends to:

be every data manager's least favorite task, and
surface all the hidden and implicit assumptions, shortcuts and ideological choices that went into the systems, what you might call the semantics of the system.

And also: Formal data migration processes are intended specifically to fix the four kinds of debt even though no one currently says that's what they're doing.

When we do data migration we try as much as possible to "fix" the systems we're migrating, which is to say we do our best to get rid of the ad-hoc add-ons and integrate all the parts as fluidly and naturally as we can. As much as possible we try to avoid scenarios where we've got kludgey "glue" that holds stuff together. We make these fixes both in terms of the database models and the content schemes, which in past posts I've just called the "data models" for short.

Consider, for example, the bog-standard process of migrating two legacy ERP systems into a third newer ERP system. A common case here is an organization that's long used an ERP system (let's call it O1), that's recently acquired another large organization with a long tenure, one that had its own ERP system (let's call it O2). O1 will inevitably have been customized, as will O2, to meet the preferences of each set of business users. Those customizations will be seen throughout both organizations, from finance to supply chain to HR, and each modification is made to the ERP system's original data model and content schema.

The data model in these ERP systems is the thing you think it is: The database schema in whatever persistence mechanism the ERP system uses. (Even a ERP based on a NoSQL system has a database schema, obviously.) There will be a header/detail/item tableset for orders, for example. Also likely a ledger/ledgerjournal/ledgerjournaltrans structure for tracking AR and AP transactions, and some variation of the party model for customers. And so on.

There's also a large content schema. (Subsets of the overall schema I'll call "schemes.") Some of this content is just the values of the lookups used to support the ERP system's forms: Dropdowns for zip codes or postal codes, three-digit or two-digit country codes, address types (e.g. HOME and WORK), etc.

Other content schemes are critical to the operation of the ERP system, and don't just help populate forms. For example, the ERP system may store its order statuses in a lookup, and use a near-generic set of fields in the model to store the current status of an order. In that case the header table for example will contain a field called ORDER_STATUS that takes various values, e.g. {NEW, BOOKED, BILLED,...}. Sometimes an individual content scheme is stored as a lookup in the database and sometimes they're functionally a list of values stored in actual software code, or what's known as "enums." (Enums, as I've mentioned, are very very bad.) But the ERP system will do things to e.g. an order based on its order status, and so if the default list of order statuses changes - if the order status scheme is customized - then some code must be written so that the order management modules can take advantage of the custom statuses. This customization likely impacts reports in the reporting layer, too.

(Other times the status of an order is stored as a timestamp in a fixed set of fields. In the latter case the header table contains a field called NEW, and another called BOOKED, and so on, and if that field's timestamp is later than all the rest then that field's status is the current order status. This just means the content scheme is reflected in the data model, or vice-versa, which is why I don't really distinguish them logically. Now this latter approach may appear to cause all kinds of problems and so: Who would actually do that? But as the kids say, don't @ me. Of course it causes problems. So does the other one. Data models satisfy many wondrous niches.)

One of the most critical content schemes in any ERP system is its chart of accounts or COA. A COA is a list of all the accounting buckets used by the organization. COAs ought to be relatively uniform, you'd think, and in general they are: Orgs pay for similar things out of similar accounts, whether they're based in Kyoto, Malibu or Djibouti. But in practice they are only semantically similar, which is to say that if you put three comptrollers together in a room with their individual COAs they could find, more or less, the same things in the other COAs as they find in their own. In practice though two COAs are never logically isomorphic, which is to say you cannot map one COA perfectly into another. The semantic equivalence is at the level of fuzzy super-categories within the COA, in that everyone needs to pay vendors and employees and get paid by customers, because the basic principles of accounting are the same across each enterprise. COAs differ in part because of local considerations: You categorize vendors differently in Malibu than you do in Djibouti, Kyoto has different legal reporting requirements from Djibouti, and inventory issues in Kyoto are handled differently in Malibu.

You also have the personal preferences and style of each comptroller. Some comptrollers want to be COA Architects, and others use what their favorite boss used or copy something out of a textbook. Even then some comptrollers prefer to track certain accounts with a more fine-grained subcategories than others, because they feel it gives them a better handle on the business. Others choose to change their COAs regularly, to adapt to changing business scenarios, obfuscate certain transaction histories, or simply because it keeps their people on their analytical toes.

I've picked ERP migrations as an example here for two reasons. First, they're extraordinarily common, even in this age of DIY/Agile/everything-into-the-MongoDB startups, so almost everyone has some experience with them even if they're not conscious of it. Second, they're also fairly well-behaved, intentionally. ERP systems are designed to be a series of software templates that, on implementation, force the business to adapt to the software's conception of the business model; in practice of course there's a lot of customization. But even if an organization uses a home-grown system as its core revenue-generating transaction engine - so its not an Oracle or SAP ERP system, for example - and its a DIY/Agile/everything-into-the-MongoDB unicorn, that core transaction engine will still consist of those two main parts. There will be a database schema and a content schema. The former stores at least the bare-bones or structure of the transaction. The latter provides at least the meaning of the scoped parts of the transaction, or the values of the scoped variables in the transaction formula.

The migration process

So let's go back to O1 and O2. Let's suppose we want to try to move the data in both into a third system, O3. This is actually the easier case than trying to merge O1 & 2, although perhaps only from an emotional/conflict resolution standpoint.

The logical process (or planning schema!) for migrating data from O1 & 2 to O3 is something like the following:

Lay all the models and content schema out on a table.
Build a mapping function, including any transformations, reverse engineering O3 from the models and content of O1 & 2.

Transformations include any simple string operations - truncation, concatenation, etc. - that need to be done to fit the source data into the target.
Transformations include any aggregation or dis-aggregation that needs to be done on source data to get it into the target.

Run the migration process and see what doesn't fit.
Evaluate the fallout in (3) and determine if its critical for business operations.

If it is, then because it doesn't fit you need to go back to (2).
If it isn't, then you add it to the list of future priorities with an appropriate "to be fixed" priority.

When you've reached a satisfactory state for 4, you're done. If you add some time estimates onto each piece and apply some Agile guesstimation rules to each iteration, it seems pretty simple, huh?

Well, except that you're not done. O1 & 2 will both have something of an accretion disc of reports, ETL processes, exports to other systems in and out of the org, custom-built forms and home-grown applications, and a nebulous cloud of semantic expectations surrounding the ERP system. All of these semantic assets also need to be sifted for value and a migration status determined.

In some cases these semantic assets will assist the migration: They can tell you what the original model or content scheme means to the business operationally, and allow you to prioritize accordingly. During the migration they can cover for certain non-essential functions, and provide a proof point to benchmark against. The latter requires they be semantically distinct from the main ERP system.

An ODS and other semantic reorganizations of an enterprise's data are examples of these semantically distinct assets. Data lakes, being mere co-located copies of source tables and schemes, are not semantically distinct assets, and a data lake is just a giant slug of inert and unhelpful semantic debt during a migration. (After the migration, you can and should just throw it away.) Note that if you personally have a "data lake" that isn't just copies of source tables because you reorganize the source data into new tables and consistent schemes that people report against, then you don't have a data lake. You have the beginnings of an ODS, which is a good thing.

Most of the time however these assets just need to be migrated. In all cases these assets were built to compensate for the delta between what the data model or content scheme does and what the org actually needs. That is, the assets are there because there's a semantic debt created in the gap between the ERP as implemented and the business process, and the asset tries to close that gap.

To understand why this is, you need to just consider basic everyday business process. We type stuff into a form on top of our ERP or import files into a table over the course of many days and eventually come to realize we need to look at the data we've typed and imported in a way not envisioned by the ERP's designers and implementors. Maybe we need a late-order report. Not a backorder report, which comes standard with the ERP system, or an overdue-order report, which is also standard, but a report that allows us to specify, for some group of customers, which orders should be shipped tomorrow but are today in a status that's not "ready-to-be-shipped-tomorrow." We want this report so we can call the customers and tell them their orders are late, or so we can yell at the guys in the warehouse, or whatever. But we need a report that isn't in the stock of reports that comes with the ERP system. So we ask our BI team to build us one, a late-order report.

That report is a semantic asset. Now a little later on the guys in the warehouse get tired of us yelling at them and ask for the report to be emailed to them every morning. They hire an intern to build a Python script that takes the pick list they get from the ERP and merge it with the late-order report so they get a better prioritization. This is a fantastic innovation in that future inventory turns and generations of employees come to rely on this late-order report all the time, implicitly - which is to say, unconsciously, they're not even aware they're using the asset. But this new report is also a semantic asset built on top of the previous asset, with some other assets besides. It adds a ton of value to the org, and VPs tell their management that they've reduced customer complaints through some magic and everyone is happy.

And something like that asset needs to be migrated to O3. The late-order report is a semantic asset that either needs to be available in O3, or built with O3 and the other tools we have available to us. If the report is not in O3 then there's a semantic debt: There's a difference between the data models (database + content) and the business model. If its not migrated we'll suddenly be back in the position we were in before we had a late-order report, which was gigantic retailers complaining they didn't get their widgets for Black Friday.

So when we look at our migration process, its not just that we need to map O1 & 2 into O3. We need to look at the halo of semantic assets around O1 &2 and figure out what assets come natively in O3 and what assets must be built from scratch or migrated. And we need to make sure they're the same, or close enough that people in the org can still do their work, because those assets play a critical day-to-day role in someone's life. Losing the asset is tantamount to losing your car keys: Its not that you can't get by without a car in the Big Picture, but it sure puts a crimp in your daily process.

When we look at how we structure step (2) above and perform the evaluation in (4) above, for both the two source systems and their halos of semantic assets, we run into four basic kinds of issues. These are our four types of debt.

Let's recap

Still with me? Let's recap.

Data migration is the best place to study semantic debt.
Data migration means mapping one set of data models and content schema - the tables and the lookups - into another.
There's a pretty straightforward process for doing that.
We also have to migrate all the stuff that's been built around those original data models and content schema from the old system to the new one.
The stuff in (4) is how the old system worked within the business, and the organization needs the same stuff to keep running.
That's where you find all the semantic debt, and there's only four kinds of it.

So what are the four kinds of problem we run into?

Type 1 debt: Incompatible schemas

First, we can run into a scenario where the models or content schemes are different for the same process. I've used the example several times of two different order-management schemes:

This is Type 1 debt, or Incompatible schemas. In the example above we have two different ways of dividing the "order status" world up. We need to find a way to map these statuses together. Now, we might take a superset of status subcategories, or a subset or partial list of statuses, or throw out anything that doesn't have a match, or declare a Year 0; those are all valid options. But we need to decide what we're going to do, because we want to make sure there are no old orders that can't be mapped into the new system. If we declare a year 0 and say there are no old orders in the new system, then we at least have to make sure our COA can support whatever financial reporting we might need to do. The Street never responds well to "we changed our ERP system so we can't do any year-on-year reporting until 2021."

Another example of Type 1 debt: O1 and O2 might have different customer models. This difference might be as simple as O1 allowing three address lines (e.g. ADDRESS1, ADDRESS2, ADDRESS3) and O2 only having two. If O3 allows three lines, then we've got an easy map as a target. If O3 only allows two, then we need to decide whether to merge the O1 fields ADDRESS2 and ADDRESS3 and what happens if there are cases where the merged length exceeds the allowable string length in O3. We also have to worry about reports that need an ADDRESS3, too, because that field won't be one in the new system O3 and those reports will break whenever they're run.

A less trivial example - and from a migration standpoint quite irritating - is the practice in many jurisdictions of four or five levels of geographic specificity. In Canada and the US an individual's address consists of the street address, city, state/province, and country, with a zip code or postal code providing a geographic encapsulation sufficient to get to something like a proximate location. Other jurisdictions require strings for something like prefecture and/or region. In an ERP system with a North American bias there won't be space for prefecture or region - and although there might be one for county, from a legal standpoint "county" is not "prefecture." This might seem like a minor issue, but if sales tax or something similar is collected by local jurisdictions these fields become absolutely critical to the system.

So that's Type 1 debt: The fields or even tables are structured differently between two systems, and some kind of equivalence function needs to be built to say "these two fields are equal to this one field" or something similar.

Type 2 debt: Incompatible natural keys

Second, we can run into a scenario where the natural keys are different for the same entity. A "natural key" is a collection of attributes that can be used to identify individuals, to "dedupe" records in the technical sense. Logicians would call natural keys "identity conditions." (Note that the discussion of identity is, as you might expect, deeply philosophical. Luckily data management tightly restricts the scope of identity to records management.) An "identity condition" is the criterion under which we can differentiate one instance of a thing from another instance of the same thing. Its what makes you you and not your friend. Its your fingerprint, in one sense. For our purposes its just the set of attributes we use to dedupe instances of entities, or records in a table.

Suppose for example that O1 decides that a customer is identified by their email address. That is, there is one and only one customer record for each email address. There's a lot to be said for defining a customer by their email address. Almost everyone has one, after all, and its not as personal or dangerous as using the customer's SSN (in the US) or other nationally-managed personal key. Its generally both selected by a customer, so they can remember it, and a discrete unit of identification. But it does lead to some odd corner-cases. My father and step-mother shared their email address, something which is very common among Boomer couples of a certain age. In that scenario O1 wouldn't have been able to differentiate my father from my step-mother as customers. Now that might not be an issue at all in the ERP system. But if one person wants to get marketing emails and the other doesn't, which set of preferences on that email address gets priority?

Suppose O2 uses a different natural key to identify customers, say something like {name, address, city, state, postal code}. Consider everything between { and } just strings, so the natural key defining the customer is just a big long concatenated string We have a different definition of identity and a new set of issues. Now its possible to associate multiple email addresses to a single customer - a single long concatenated string - at the same geographical location. But if we have variant spellings of someone's name - e.g. if their legal name is Thomas Pynchon and their preferred name is Tom Pynchon or V - then its possible we could end up with multiple customer records for the same person. This is because the strings we use to dedupe the records are different for each variant of the person's name. (E.g. even if the addresses are the same, "Thomas Pynchon" and "Tom Pynchon" are not the same strings, so the natural keys will be different ) This makes it easy to tell at the front door which mail is spam and which isn't, but it also leads to a classic master data management problem.

But we've got a serious Type 2 debt problem here. Suppose for simplicity's sake that O3 allows either (or any) natural key to identify a customer, so we just need to pick one instead of having to do the mapping into a third. So we pick one. Either way we have to map O1's email address keys into O2's {name & address} keys, and they won't match all the time. That is, there will be many records where a single email address in O1 maps to two {name & address} records in O2, like my father and step-mother, and vice-versa. In fact we can be almost certain this will happen for some non-trivial percentage of the cases.

So that's Type 2 debt, or Incompatible natural keys.

Equivalence Failures

Those are the two kinds of Equivalence debt: Incompatible schemas, or Type 1, and Incompatible natural keys, or Type 2 debt. In both cases solutions to the problems I've described do not require rocket science. There are people with decades of experience solving them right now, all the time. But it costs money to solve these problems, and that's because someone has to build the machinery to to resolve the incompatibility. I've also called these Equivalence failures because they are failures of the data model (i.e. database + content) to match the business model.

Type 3 debt: Undeterdetermined schemas

In the third type of debt, there are simply situations where O1 or O2 will store data that isn't stored in O3. The corollary to this problem - or the same problem, looked at a slightly different way - occurs when the business wants to store something in one of the Os but it can't. There's a Coverage failure: We've got stuff going on in the business that can't be recorded in any database, relational or not. And its not that it "can't be recorded" because its mysterious or ineffable. It can't be recorded because there's not enough time to do the software or database development.

Here's an example: Suppose there's a specialized customer-service group that's been pressed into service to record customer complaints of a specific type. Maybe there's a rash of "Better Business Bureau"-type complaints and the CEO wants to know about them right away so they can get ahead of them. So someone badass is appointed to lead the team and ensure that on a weekly basis the CEO and Legal get a report with a list of these complaints, broken down by product type and legal jurisdiction, and a monthly report showing whether the numbers are getting better. While this badass would like to use the corporate standard - e.g. Salesforce or Oracle or whatever - there's not time or money to get them access. So they start using Excel or Google Sheets to record complaints, and develop some charts and reports from that.

Over time a whole series of semantic assets - implicit database models and content schemes - is gradually built to support this team. In the worst case scenario there's just a series of spreadsheets undergirding the process, held together with anal-retentive discipline and fear of the badass. At some point the tools created for this team will need to be migrated into the mothership CRM system, in which case there will be a lot of Type 1 and Type 2 debt that needs to get sorted out. But for now, there's just Type 3 debt, Underdetermined schemas.

Outside of the enterprise software world, the problem of Type 3 debt is exploding. You will often hear VPs talk about "unstructured data" as if its some kind of mysterious, nebulous ash that clogs up the file server, like the pulverized remains of star systems orbiting a black hole or the stuff that gets caught in the drain in the kitchen sink. But its not. There is a basic fact about data management, what you might call the First Axiom of Data Management:

Axiom 1: There is a point to all datasets.

and the corollary:

Corollary 1: Data is not generated for fun.

When we look at the stuff that's collected on file servers it generally falls into two categories. First, there's collections of spreadsheets that are more-or-less waypoints between raw data and finished executive reporting. Finance teams are the worst offenders here, although in nearly every organization I've been a part of if the Finance team didn't do this no one would ever get paid, so they deserve a round of applause for their diligence and patience.

Second, there's sets of device exports and their associated analyses, generally stored as variations of text files but occasionally as binaries. This data is extremely common in health care, pharma, IoT, and device manufacturing. Each device, drug trial, medical trial, networked set of devices or whatever generates a stream of output. That's what its for: You don't run a blood analysis machine because you like the blinking lights, and you don't collect heart-rate data during a run because you like wearing straps. Each of these semantic assets - a data model plus data - is a subset of what's actually collected, but there is still a data model inherent in the data. Its not unstructured. The implicit data model in the asset is of necessity a subset of what the device actually does. But what each of these things generates is pretty much just a report, intended for some downstream user. No one may ever look at it, but its there. Its a shame the VPs refer to it as "unstructured" though, because someone went to a lot of time and effort to design that output.

The analyses that may sit alongside these sets of output are usually aggregations and error calculations designed to demonstrate whether the device is functioning within some predetermined set of parameters. These analyses may themselves be useful, but they're not typically of the form "We noticed the red line at about lunch and after we came back the next day..." They're quantitative assessments and they often mean something as well.

In organizations that develop lots of devices, there will be a large amounts of development data containing device output plus any QA analyses done to ensure the devices are functioning correctly. If the org has a good data governance process then it will be comparatively easy to understand the data models of the output and the associated analyses. If the org does not, then each device will be sui generis, and clueless VPs will dismiss its data as "unstructured."

Sometimes of course we just have device data. My running watch (an Ambit3, if you're curious) and heart-rate monitor collect data and feed it into the Suunto system, and produce reports out of it. That means Suunto knows what the data means; its not in the least unstructured. If I just saw the feed into Suunto I may not have any idea what it means, even if I'm both the user and a guy with 20 years experience in data management. In the case where I'm just looking at the files on a file server in an enterprise somewhere, I'm like the Suunto user who gets a glimpse of the feed from the watch into Suunto. In that case the data isn't even unstructured for me: There's clearly a structure to it.

In such cases, where we have a large quantity of data in Type 3 debt, we should think of the data as having a structure we don't know yet. Curt Monash has this very useful notion of poly-structured data, and that's what our multi-device file server is: Its a collection of poly-structured data.

Even images are structured, obviously. You don't need to be a philosophical realist to see this: An image is the result of a causal chain connecting a situation to a representation of that situation via some mechanism. It can be a good representation or a bad one, and the mechanism more or less reliable. But its not unstructured.

Truly unstructured data is white noise.

I know this seems pedantic, but its not. There's a very critical mis-allocation of resources happening right now in data management: We've got a ton of really smart people trying to build algorithms and various non-linear functions to impute structure to the data on file systems because their VPs think that data is "unstructured."

It would be much much easier and faster if these really smart people just asked around, maybe with people familiar with the device or process that generated the data, and instead of trying to push the envelope on AI established some documentation practices.

Its analogous to watching TV with the sound off, and spending enormous amount of energy trying to guess what they're saying instead of just turning the sound on. You ever been in a business scenario where people are freaking out about things they don't know that could be resolved with a phone call? That's the problem of unstructured data in a nutshell.

So that's Type 3 debt, or Undeterdetermined schemas.

Type 4 debt: Epistemic bubbles

Almost done. Do you need to get up and stretch? Take a walk. Pet your dog or say hello to a coworker.

The fourth type of debt, Epistemic bubbles, is the most nebulous to assess but the easiest to understand. I did a whole post on this just recently.

An epistemic bubble is a set of beliefs immune to contradiction: No matter what the world throws at the theory, there's always an excuse for why the theory can't be wrong. It was famously said that "communism could not fail, it could only be failed," and that's an example of an epistemic bubble. Many religious systems function as epistemic bubbles, and one could plausibly argue that's why religion is so popular and serves the function it does. A friend of mine used to refer to such theories as "self-sealing," as they're like self-sealing tires that, when punctured, have a goo that fills the hole and allows the tire to keep on rolling.

Because Type 4 debt appears to be a function of the model or scheme its easy to think its somehow "bigger" than the other three types of debt. But its not, and there's a very easy example.

Consider the familiar order-status content schemes in this graphic:

The Old Categories scheme consists of the states {NEW, BOOKED, BILLED, SHIPPED, DELIVERED}. What our content scheme claims is that all orders in our system will fall into one of these states. Now the order of the states - whether you can go from BOOKED to BILLED and then back to NEW - is a function of how we design our order management process. Sometimes orders might shift around, sometimes they might flow straight through. But those are all rules for transitions from one state to another, and nothing in the scheme dictates the order.

But how do we cancel an order? Cancellation happens all the time: I don't like the color, I don't have the money, I found a better price or it won't be delivered on time. I don't see a CANCEL state anywhere in our list of order-status states. If O1 uses the "Old Categories" scheme and O2 uses the "New Categories" scheme there's still no room for cancellation, because neither scheme has a CANCEL state. We can spend days arguing about mapping Old into New and come up with a perfect-fit mapping for O3 over hard-won negotiation. But O3 still can't cancel an order.

Well so then what? What happens when some irate customer calls customer service and wants to cancel an order? (Maybe that's why there's all those BBB complaints...) We simply don't have a vocabulary, a concept, a state, whatever, for cancellation. We can't do cancellation. We can't report on it, for example, because there's nothing to report: We don't have a state of being cancelled.

This is a very simple example of Type 4 debt, but its easy to see how it can become more elaborate. Many systems paint their organizations into a corner. They know they need to track households, for example, but not without a complete redesign of their data model. Or they know they need to allow cancellations and returns, but because their order-status scheme is hard-coded they can't make the change without rewriting the software. Type 4 debt is about the match between the implications of what your semantic assets can do, and what they don't permit.

This kind of debt is difficult to assess. The koan I mentioned in my earlier post was:

We can only report on what we can do, and we can't do anything but what we report on.

That is, the business is tightly constrained to what its data models (i.e. database + content) describe. One simple heuristic for assessing Type 4 debt, which we'll get into in a future post, is organizational comparison: Ask people who've worked at organizations other than your current one what they used to be able to do in their jobs, but that they can't do now. And not just "can't do" because there's not enough time; "can't do" in the sense that no one in the org even thought they might want to do it.

I once worked at an organization that had only recently started email marketing. In its 20-year history up until about year 18 no one had ever sent an email trying to sell something. So at year 18 they went out and hired a bunch of email marketing people, who came in to find not just a lack of infrastructure but no shared vocabulary to describe the problems they needed to solve. The Marketing people would tell the software engineers they needed to track metadata about the email, and the engineers might think that meant just "time and date," for example. That "metadata" might mean various kinds of content, subject line, header, boilerplate, in addition to time and date of send and some kind of approval process, so that Marketing could do basic A/B testing was just opaque: No one at the company who hadn't worked elsewhere knew what this system might look like.

Now the solution was not rocket science. Its either (a) buy a self-contained tool, of which there are many and varied, and/or (b) describe in excruciating detail the data elements required for a mature email-marketing program so someone can build it. But you couldn't justify (a) until (b) was done, and (b) requires some delicacy. No one wants to be treated like an idiot, but explaining (b) in sufficient detail to support a buy-vs.-build decision without also insulting your developers, Directors and VPs takes some experience and a lot of trust and good-will all-round.

The lack of experience in email marketing thus led to a lot of Type 4 debt. The organization was in an Epistemic bubble when it came to their email marketing processes.

Coverage Failures

Type 3 and Type 4 debt are examples of Coverage failures. These are situations where, in our ERP example, the data models (i.e. database + content) simply can't manage or describe the processes the business needs to describe. The data model is insufficient, either because the data that needs to be managed doesn't appear to have a schema (or the schema doesn't appear to be complete or completely known), or because the data models simply won't fit the data.

Why only four?

This is actually really straightforward. In practice, in data migration, we run through the process of trying to import data from source tables into target tables.

The first problem we run into is: Can we find a table or fields in the target to put the data from the source? This is where Type 1 debt pops up.

The second problem is: Can we dedupe at the target using the same natural keys we use at the source, so the same number of records show up in the target as show up in the source? We don't want to go from reporting we have 100 million customers before the migration to reporting we only have 10 million after, for example. This is where we start to see Type 2 debt.

The third problem is: Is there data in the organization we hope to manage in the ERP system (or whatever) that's currently managed in some kind of home-grown Wild West-type system, like our badass's Excel spreadsheets? Is there a bunch of device data we need to analyze, and we're guessing or "applying ML techniques" to determine its format or schema? This is where we have to sort out how much Type 3 debt we've got.

And finally, we wonder what we can't manage because of our data models. We spend months migrating O1 & 2 into O3 and discover that we can't do cancellations. This is where we need to start cataloging Type 4 debt.

The End

The end.

Semantic Debt

Friday, August 17, 2018

More detail on Types of Debt

Our working example

The migration process

Let's recap

Type 1 debt: Incompatible schemas

Type 2 debt: Incompatible natural keys

Equivalence Failures

Type 3 debt: Undeterdetermined schemas

Type 4 debt: Epistemic bubbles

Coverage Failures

Why only four?

The End

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse