Semantic Debt: Data model patterns are *not* just for relational dbs

A short digression. Many people who are all "NoSQL f0r3v3r!!11!!" may look at this discussion of "semantic facts" as patently archaic. Why would you structure anything of value these days, especially when it slows you down and isn't agile?

In particular, I've been talking for the last few posts about building "facts." Facts as in fact table, or star schema, which is to say Kimball-style data warehouse. And we all know what that means: A relational database, just like your dad used. Relational databases, as we all know, are the Ford F-150s of database design. We'd all much rather drive a new modern Tesla 3 series or some kind of Musk Truck than a Ford F-150. Why would we use a relational database when we could use the latest version of a document storage system designed by this guy I met at the Hotel Utah last weekend. For one thing, you can just append fields wherever you want, and for another, its blazing fast on the 2Gb of data I need to store!

Well sure, if that's what you want.

But there's an enormous misunderstanding here. I'm not talking about relational models with this work on semantic debt. I'm talking about patterns of persistence. How you choose to manage your data is critical to both its short-term utility and its long-term value. Codd's relational algebra is partly about a way to manage writes and reads in a way that doesn't mess up the queue. In those terms, much of what concerned Codd was purely a function of the mechanical engineering at the time: Once people were able to figure out how to do writes and reads to disc in ways that didn't create bottlenecks relational distributions of data became less of a queuing concern.

(The page I link to there is a piece I've written for a separate project.)

But Codd's insight - what I call Codd's essential insight at the link - is much more about semantics than it is about bottlenecks. Codd did his PhD work on cellular automata, which we all know now as the building blocks of nanotechnology. So he was a reductionist by training. And his first job out of grad school was in database operations, managing the hierarchical data models that were everywhere at the time (and which are everywhere now). Now anyone who's actually had to manage a complex data set in a NoSQL system will have to admit that they're a pain in the ass: Early modeling assumptions can quickly turn into thousands of lines of code completely devoted to keeping things in synch across "entities." There's something of an arms race in modern NoSQL development, between making the model even more flexible and speeding up the ability of the model's users to keep everything in synch. But Codd had to do all of that for IBM, way back when. So he needed some heuristics that would help him model data in such a way that he wouldn't have to write thousands of lines of defensive code.

The answer was the relational model, but its really just separation of concerns for data management. Its not relational, its architectural.

Consider two data models you might use for analytics. (If you don't think you've got a "model" in your data because you never sat down and modeled it, then you've got a bad model anyway.) The first is your e.g. standard modern 1000+ field file that merges customer attributes with marketing attributes with some transaction data. What happens if you need to change customer attributes? Do you go line by agonizing line through each record and make the change? Well sure, you say; with the modern tools you can do all sorts of things quickly that you couldn't in Codd's day. Tesla vs. F-150, after all. Except that you need to put a lot of effort into making sure those cases are well thought out, the code properly defensive, etc.

Consider the second model, which compresses all the customer data into a file with unique records, which can be linked to the main 1000+ field file by guids. Operations on that compressed file take so much less time simply because the file is smaller. If your main file is something like a list of all the org's interactions with a customer over the last five years, because you're doing attribution or LTV or recommendations, then you might have to update a single customer's attributes 100 times. And if there's millions of customers that could all take a very long time.

So you normalize the customer data and put it into its own file, and instead of updating 100+ events X 10 million customers you update only 10 million customers.

If you're smarter still, from an operations standpoint, you normalize that customer data even further so you can update components without touching anything else. This makes your results both more reproducible and faster and easier to produce - you don't have to to do a regression every time you update an attribute in your 100 million row+ file.

Keep normalizing - there's a word your dad probably used - that gigantic file until its really just down to the essential formula you're trying to analyze. This may be the straightforward ecommerce question Who bought What from Whom, When did they buy it, and How much did they pay for it? Or it may be more complex. But if you remember back to your Introductory Logic course, each proper name in the formula is a variable, and should get its own separate file.

Link each unique record in those dimension files with the central file with all your facts in it and you've got a star schema.

Using the star schema pattern to manage your data even if you are using a NoSQL system or a file-based system like Hadoop will make your life a lot easier. Codd's insights about the relational nature of data aren't a recommendation about which brand of database to use, they're about how to model your data so you don't have to work so hard to keep it useful.

This is exactly the operating principle behind columnar databases, if you think about it: Systematic normalization can achieve vast rates of compression.

But the columnar db implementation is really just one way to implement that model. You can implement it relationally, which is how everyone thinks its usually done. You can do it in files, too, with the right conceptual infrastructure. Or image elements.

So when I talk about creating "semantic facts" I'm not suggesting you have to do any of this in the latest flavor of SQL Server. You can implement semantic fact tracking in any medium you want.

Semantic Debt

Thursday, November 8, 2018

Data model patterns are not just for relational dbs

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse