Semantic Debt: The Marketing Analytics Apocalypse

The Hertzprung-Russell diagram of stellar evolution

When I started this blog in 2018 I was unemployed, an immigrant in the US during Trump's first term, a data architect who didn't have any cloud experience and who didn't think it was probably possible to actually do data architecture successfully as an FTE. My family was doing their own thing, and all I could do was wait out the world and keep myself busy.

I decided to work on two problems while I had all that time. One was the data model-integration problem that everyone who builds a data warehouse or just glues together extracts in an GSheet faces: This weird seemingly non-linear estimation problem, where even if you know both sources well you can't necessarily be confident you've got the meaning right. And this is non-linear for all sorts of reasons: Maybe the extract lost some records in transit by accident, or the formatting of the report was slightly off when you downloaded it, or someone had been using a custom field to represent the status of that business process for ten years but then they moved on, and now the field isn't maintained. The other problem was to stay busy, outside and occupied. That was largely resolved by bowhunting for deer, turkey, rabbit and bear, in the Coast range just miles from the Pacific in Mendocino county, the Sierra north of Tahoe/south of Lassen/east of Sutter Butte and Gold Country, and in the Diablo range west of Coalinga, and the searing hot marshlands of the Central Valley around Los Banos.

I had zero luck that year on animals, so in that sense the "hunting" part was a complete failure. I'd bought a membership to a hunting club that worked with Forestry companies and landowners to let members hunt on their land before I was unemployed, because the hunting on public land in California was supposedly terrible. But I put an arrow in nothing that year but targets, which was still pretty fun, and did a lot of bowhunting in the utter quiet (minus the occasional loggers or cattle) of the huge trees. Bowhunting is a volume proposition, so rather than go with the hiking boots my dad would have used (or cowboy boots, because he'd rather ride), I decided to just go trail running. Trail running with a bow is quite a lot of fun, obviously a very ancient way to profitably spend your time, and had there been more animals - rabbits, turkeys, deer, bears - there might also have been more meat to take home.

I had a lot of luck, on the other hand, with Semantic Debt. No one at the time read it, and very few have since, although I'll address the evolution of the concept that has occured since 2019 in subsequent posts. But the work on this blog helped reset my confidence in my craft and my judgement. The model I developed to explain how semantic debt manifested in software engineering was a coherent theory. When I used the model to make predictions it did what a good model should and told me things I didn't know that turned out to be true! Or at least that matched my experience. And it could explain much of the mismatch between strategic dreams and tactical nightmare that happened leading projects as a data architect.

A Data Architect's role at an organization is to drive regularization in data collection and use across the organization, given and within the constraints of the organization - technical, organizational, cultural, strategic: the list of possible kinds of constraint goes on. Much of the difficulty in the job comes from balancing those constraints well enough that you can set a target state. But that balancing work is the 80% easy part of the problem. The 20% hard comes with the actual work of integration, which we're taught to think comes to an end but practically speaking can never come to an end, because when change stops the system has stopped responding to it's evolving use-cases. But without the kind of detailed forensic examination you'd need to actually do the production integration, there's no way to know how much work is actually involved in that 20% hard. The more customizable the source systems - the Salesforce or Netsuite or home-grown ecommerce or inventory management system - the more the business process depends on quirky, often time-dependent "business logic" that no one but the critical few understand and have access to. Like a Renaissance farce entitled The Meaning of the Fields the Data Architect and their people, should there be any, chase these insiders around and desperately try to find ways to implement their secret recipes as known rules. Imagine doing that for the same people for 20 years. Imagine being hired at year 15 of that, told not to relitigate 15 year old arguments, but still document contentious business meaning.

Now multiply this 20% hard across a dozen internal systems, with overlapping responsibility for managing data about one or another of the things in the organization's workflows. Inventory is handled in three systems, customer data in five, and some of those overlap. Why do they overlap? Well that's usually a series of reasons you could start to catalog, and that became fairly easy to abstract out the "structure" (as the data modelers say) while I was rambling quietly down a fire road in 105F heat. Much of the taxonomy of semantic debt came to me pretty quickly, just a straightforward matrix of mutually exclusive categories that pointed to perhaps more ways data might be transformed, but were at a starting four pretty clearly sufficient.

Conway's Law always struck me as a really solid heuristic when figuring out how a company's ecosystem is organized. I'd implicitly applied it multiple times in my career at that point, in 2018. But while the software architecture diagrams show the boundaries of logical systems, and those generally match the funding organization and it's culture, the data produced by those systems is also strongly influenced by the software people who design the collection system(s). You could have an ecosystem with lots of little apps that did all this data collection and/or transfer for people, where some apps have databases and some don't, and they more or less communicate with each other consistently. Or you could have a single app - say, the Order Management module of a big-iron ERP system - that manages everything for a single function, or as much as was possible to negotiate between people and engineering. The software engineers are a major influence on what data is collected and how, for sure, with their own theories and horizons and best practices. The user-experience designers too, who start with the original "folk" workflows of the org buying or building the app, usually well-worn spreadsheet paths that connect dozens of dots across teams in exchange for massive cognitive overhead. UEx maps workflows into a visually semantic rule framework that abstracts workflows, into user paths that preserve truth, ideally into technology to record permanently where necessary and/or sufficient, and also ideally into a culture program to ensure everyone is happier and more productive and efficient. And so UEx people also filter and influence what data gets recorded. With these two additional cultural dimensions added to Conway's Law you get even more value from the heuristic.

The more astute software engineers and user-experience designers were even in the mid-00s looking at the way they did software and wondering "where does all the data go?" But by the time I was writing Semantic Debt, the general perspective was that "data modeling" was a thing of the past, that progressive web apps that stored little to no state made it easy to build software, and that even much user-experience design was an unwelcome friction in the mind-meld between product manager and Agile software team. When teams go through the exercise of "automate process" they do what the product manager says to do. With the progressive web-app paradigm, there simply isn't a need for history. The imposition of an expectation of logical consistency on a development team's choices from one release to the next means you have to introduce "logical consistency" as an NFR on a dev team's list of architectural constraints. Of all the NFRs, which might as well stand for Not For this Release, logical consistency was very low down the list. It's much harder to define than the Security NFR, or the Reliability NFR.

What that frictionless development process means for Data Architecture, practically speaking, is that if you can't develop a personal, friendly relationship with every one of those dev teams, everyone on your "data team" is directly downstream from a chaos generator and completely blind. Even if you are friendly with the relevant people, a social connection often perceived as sort of creepy for people who've never met a Data Architect before, all you can really do is call down the changes to the people downstream, giving people at the bottom a little more time to possibly find a place for the new Tetris shape as it floats through.

If you can't solve those problems - regularize the data flow, using technical standards but mostly personal charisma - then if you're reasonably good at your job you know what will happen. Data will be wrong, a critical report will be wrong, a critical exec will be embarassed, a critical customer or supplier will be offended. The source of the bad data can be identified, pretty reliably, if you can traverse the hierarchy with any faith. A remedy can be negotiated, if it's perceived as necessary by the PM and SE teams involved. But more non-linear variables: The organizational hierarchy is murky-to-secret, tangled, often hostile. The Architect must be ready and willing to put on a smile and a positive face with all involved in the search and the fix, no matter the tenure of the problem. If charisma is lacking that week then maybe the whole thing collapses. More often than not developers and PMs would never even entertain a call with a/the Data Architect in the Analytics group. When that person makes life difficult over multple sprint cycles, they simply stop coming to meetings because they're too busy.

And that's just one group. Conway's Law suggests you can draw a semantic boundary around all the people who use the apps those Product teams build for internal teams, whether that build scales across tens of thousands at Salesforce or Adobe or dozens at product startup ABC. The combined culture influences how those apps interact and what functions they automate and which they don't and thus the data reflected in whichever persistence layer.

With all those non-linear variables in play, I just didn't see how you could do that Data Architect job. Eventually, and for me it happened within the first few months at the last few companies I worked at before unemployment, your entire day consists of avoiding the phrase "well as I said would happen, here we are" or any other diplomatic or emotionally neutral version of "I told you so." The Data Architect is that guy telling you that if you don't work out you're going to have trouble later, and the entire org has been told that it's ok to eat more Doritos and drink more scotch right now because someday there will be a solution that doesn't involve doing what that guy says has to happen.

I also had no Cloud experience. Now of course in 2026 I have tons; I don't think I've looked at an on-premise database in five years, and that was maybe one of four in 7.5 years. Everything I work with now is in the Cloud, or one or another of them. But in 2018, I was an immigrant Data Architect in Trump's America with no Cloud experience, when the progressive web app movement was reaching something of an apotheosis in its prophecy that the rejection of logical consistency would have no long-term consequences. (Has that ever even stopped?) I had lots of time that year and very few interviews, all the way up to mid-December when a friend connected me to the managing partner of the company that hired me by January 2 the next year. My thinking about semantic debt thus had a constraint that proved extremely helpful - much of the chatter in database forums is about the technical difficulties of connecting systems together, the technical difficulty of encapsulating complex business logic within a single ETL package. I was averse to the phrase "data pipeline" because I knew nothing about how they were structured. I had to force myself to abstract out all that I thought I knew so no one could catch me out on a feature of the Semantic Debt model that would only attach to on-prem relational old-school databases and point-to-point ETL packages. And that forced me to think at the field level - surely there were still "fields" in cloud databases, I thought - about how the modeler built up a picture of the organization from its apps.

Many of the benefits of the Cloud paradigm fixed issues I knew we had with on-prem systems, in particular the preference for DevOps and source control. When I discovered my job didn't depend on negotiating for tuning privileges, I was actually ecstatic. (Much as I myself had done little with source control and developed tuning expertise only where needed, I thought both were a good idea and automation would make life great.) But many of the problems with the on-prem paradigm also feature in Cloud environments. The war between the ETL and ELT developers still proceeds with no gap or let-up in hostilities. Either a shop is one or the other, and while the ETL shop is wrong, all an ELT-aware Data Architect can do when they're stuck is say "Well as I said would happen, here we are."

In 2019 I started work as a consultant for a modern boutique "Big Data" consultancy, although by then the concept of big data had started to lose some edge. But these guys genuinely were "big data," always handling the cases no one else wanted, whether because they had more data than anyone knew what to do with or more of a mess than anyone could figure out. They took them on confidently, professionally, and with success most of the time. They agreed with me on the need to treat Data Warehouse rollouts as product management problems and not technical problems, and which parts of the timeline were riskiest. One of the managing partners asked me what I thought of MongoDB and I gave a polite answer, and he asked me why anyone would ever give one to any developer. He thought there was no use case, because of the semantic debt that introduced into the ecosystem. He didn't use the term semantic debt, and I don't know if any of the partners ever did read any of my stuff. (Too wordy. Although now they can get AI generated summaries.) But whatever; the concept isn't hard to miss just from the name alone, and we recognized we'd seen similar problems and had to develop similar solutions. And at that point my career was largely a bunch of 1- to 2-year gigs trying to do data warehouses in a variety of environments, succeeding at some priorities and failing at others, functionally working as a Consultant even if I was emotionally an FTE.

Wow I made a lot of mistakes. But let's not talk about those. Initially some of the work was very specific to Marketing data, which I'd worked with once before, at Adobe. The goal on many projects was to find a way to connect the Marketing operations efforts (or whatever) back through to the rest of the enterprise in a continuous improvement loop, from planning to action to measurement to planning. To do that we needed to be able to build logical foundations fast that we could monetize later, a general proof architecture. This may not have been the stated goal, but I came to believe over the course of the last 7.5 years that it is possible to provide everyone within a Conway boundary with more-or-less good data that is most important logically consistent across all use points. But that's because I began to see data modeling through the various lenses required to understand use points throughout the enterprise. Salesforce data modeling is very different from Adobe data modeling is very different from Meta and Google Ads and GA4 and Kochava and the list goes on and on. The modeling is different for all the reasons I mentioned above - the coupling of "people who put data into the apps hiring people to build apps for them to put data into" and "people who build apps quickly" is a non-linear, 4-dimensional relationship. Some elements you could quantify precisely, to the millionth of a penny, and others would take a team of Anthropologists and Psychotherapists and batteries of tests and interviews.

So the modeling isn't going to be unified at the source. Not only is it not going to happen now, it will never. Marketing is the first example of the exponential explosion of semantic debt. We're going to see the same thing happen with Agentic AI, an acceleration of both the debt and the need to pay it off. The Marketing industry value proposition is entirely dependent on every data model being fundamentally different from everyone else - "My view/platform/experience/media is different from all others, and as a result i have my own model for the engagement" and you must pay me for access to ad space until someone else replicates or one-ups me. It's weaponized McLuhan, weaponized sociology, a logician's nightmare. My awareness of how many platforms there are in the Marketing ecosystem - and its more than there are human-eye visible stars in the sky, all hemispheres at once, which by the way is ~4500 - has significantly increased just in the last couple of years. What I saw in 2012 looked barely doable - technically possible, sure, but not politically or culturally likely. What I saw in 2020 scared me. In 2021 I figured out an abstraction that would let me bypass the necessarily organic modeling that might otherwise take place in a Marketing project, where you'd begin with e.g. Facebook, design an abstraction, and then try to integrate Google Ads or GA4, or an email service provider or a CDP. Each randomly chosen next-step in the integration process meant optimizing for a brand new set of fields, each with their own logical rules, many of them so inpenetrable they may as well be encrypted. An entirely new ontology, an entirely new data model, an entirely new and unique combination of SE-managed enums, lookup tables, changeable JSON schemas. In short, each step can't but result in it's own misshapen semantic blob, and unless there's some kind of actual Grand Unified Theory that let's you categorize records from different platforms into some high-level entities that you can map into some pre-determined semantic structure, you're always optimized for the last three sources and no more.

The other key piece of my first-pass attempt at handling the problem is a way to handle the mapping. A semantic equation between the value of some field from some source to the value of some other field in some other source is the end state of all this work, and at it's most basic the whole data model is an actual physical hierarchy mapping two or more semantic hierarchies into each other, and we should just store it like that. My theory is that I can eventually force someone to find a way to make the hierarchy management more efficient, finally after 29 years of doing this work. But ultimately the requirement is to version these hierarchies and keep their instances back a certain period. So I bit the bullet. Bought the bullet? A Data Analyst will do that hierarchy management work in a CSV that becomes a dbt seed, when they need to do a rapid Marketing data model. You could build up a library of stable semantic hierarchies across platforms, and customize those as needed, or build a brand new one. But all of the 4-dimensional mapping involved in the new data model is necessarily flattened into a semantic hierarchy of two dimensions.

That focused work on simplifying my understanding of Data Architecture so it will work across more use cases, with less customization, as a plug-and-play way to "inflate" a Semantic Layer within a plug-and-play pipeline management system, and the exposure to the depth of the problem in Marketing - well, I don't feel more equipped to deal with the semantic debt.

AI is not the answer, btw, and while an LLM will have some use my prediction is for no more than low double digit increases in modeler productivity. You'll get more out of routine scans with a sophisticated data catalog across the whole ecosystem, unfettered access to all the data plus the Product teams, NotebookLM for documentation, intensive interviews for a few weeks with key players - you get the idea. What's missing is context, and the LLM can offer domain-specific hypotheses but not abductive inference.

There are a lot of techniques we might try to survey the space created by all these signal collectors, sort them into categories and synthesize the signals into a coherent, provable set of facts. I'm not sure if we'll even get to try them before Digital Marketing is overwhelmed by the noise of semantic debt. The marketing analytics apocalypse is near, the day when it becomes practically impossible to get a tunable, usable signal because of all the semantic debt.

On another note:

I won't be gone for long. There's some more thoughts I'd like to get down.

Semantic Debt

Monday, May 4, 2026

The Marketing Analytics Apocalypse

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse