Monday, July 30, 2018

Ad Hominems and Data Modeling: When to use the EAV




Data modeling patterns are like logical fallacies.  Almost everyone knows that ad hominem arguments are bad arguments: Attacking a person's argument because of seemingly arbitrary attributes that person holds is bad judgment.  If I say "Politician X makes the case that A, but since X has a really bad combover its clear A cannot be true" we've committed the ad hominem fallacy.  X's bad combover has no relevance to A, and because it has no relevance using X's combover to attack argument A is a logical fallacy.



Now, what if A is an argument about how to cut one's hair?  If A is something like "one should let one's hair grow really long on the top, dye it orange and sweep it backward so it looks long and luxuriant," then claiming that A is an argument made by X about how to do combovers, and A is a bad argument because X has a bad combover is a valid use of ad hominem.  In this particular case X's personal characteristics do have relevance to the argument X is making.

Fallacies have long been known to have this dual nature, at least since Aristotle: A logical fallacy is invalid precisely because its irrelevant to the argument at hand, but the pattern is valid when the argument pattern is relevant to the argument.

Data model patterns and software patterns in general follow the same path as logical fallacies.  (If the reader has not got a copy of Patterns of Enterprise Application Architecture by Martin Fowler they're strongly encouraged - strongly encouraged - to get one.)  There are appropriate and inappropriate uses of patterns: MVC or the "model view controller pattern" is a handy way to structure an application if you've got a team with a certain level of maturity and a project with a certain kind of complexity, but in many situations it simply is not a good idea to use MVC.  Like a logical fallacy, there are lots of cases where MVC is clearly inappropriate, and lots of cases where it is appropriate, and a gray-area where thoughtful developers and architects can argue a case one way or the other.

The EAV pattern (or "Entity Attribute Value" pattern) is a common data model pattern in structured database modeling and so pervasive you might even say its foundational to NoSQL systems.  (Good treatments of the pattern are here, here and here.)  Developers love EAVs, which is why its become the basic structure for virtually all NoSQL storage systems, in one form or another.  (Document modeling systems still generally structure their persistence layer as an EAV, but there are exceptions.)

When I first started really getting into data modeling I thought I'd discovered a universal truth in the EAV pattern.  I abstracted up as high as I could and ended up with Values, which is to say actual things, which are assigned to Attributes, which are the properties of Entities.  If you could dissect the world into Entities, Attributes and Values then you could describe the current state of everything.  So I sat down and built that model, and quickly realized I needed a bunch of additional fields for various datatypes.  I mean, you could store everything simply as an NVARCHAR(max), but you did need to indicate what the original datatype was so the application developer using the model's output could figure out how to display the data.  And you needed a bunch of metadata too, so you could track changes to values for a specific instance over time.  And you needed a notion of "instance of an entity."

Lucky for me I was just clever enough to save myself from any further ascent: At around the same time I'd discovered you could create tables by typing stuff into the system tables of the database, bypassing all the GUI data modeling machinery.  (I think I was using Sybase at the time.)  As I proceeded to type my fancy EAV model into the system tables in the database I discovered that I was, in fact, replicating the existing system tables - the database was already an EAV!  I'm not a fan of exclamation points, but this was a moment of mind-blowing through-the-looking-glass profundity for me: The Database was Already an EAV!   

And indeed it is.  In fact all persistence mechanisms that aren't just chunk-based storage like a file system are some form of EAV; even most file storage systems use something like an EAV model to indicate where an instance of a particular file type might sit.  That's because the EAV is the least semantically-loaded data model pattern there is.  It contains, in fact, no semantics.

And that gives us a clue as to when an EAV can be a valid data model pattern and when it can't.  Consider the classic EAV antipattern:

  1. Parminder designs a user-profile application and decides that she wants to use an EAV model - probably directly into a NoSQL system of some sort - to store data because the product management people haven't yet decided all the attributes they want to store.
  2. One of those attributes is DATE_OF_BIRTH, which is needed for compliance.
  3. Parminder moves on, but the db is still called "Parminder's system."
  4. Six months later Bob is told to use the user-profile db to help with email campaigns.  Marketing decides it would be good to send an email to users on their birthdays because who doesn't like getting "happy birthday" emails?  Especially older people.  And its not creepy at all.
  5. Bob looks in vain for BirthDate in the ATTRIBUTE table in Parminder's system and can't find it, so he adds a new record in the table.
  6. We now have two places to put date-of-birth.
This example is so common its probably happening somewhere right now, the second you read this sentence.  As semantic debt, it takes the form of "incompatible schemas" or Type 1 debt.  Which is to say its the simplest type of debt, because its just two different ways to divide the world up.

In this particular case we'd only see a bad user experience if we found some parts of the app using DATE_OF_BIRTH and some parts using BirthDate, and if the values got out of synch somehow.  In practical terms they're unlikely to get out of synch.  But that's purely through the luck of the draw: If instead of date-of-birth Parminder had needed to store ADDRESS_BILLING and Bob CC#Address we might have had a much bigger problem.   

But we can see where we went wrong with our use of EAV.  Parminder chose to use the EAV because she thought product management didn't yet have a firm semantics in place.  Perhaps it wasn't clear, for example, that DATE_OF_BIRTH would even be used in six months, in which case it didn't make sense to build a structured or relational set of tables to store the data.  We run into these cases all the time: We don't know which fields are important yet, but we do have a laundry list and we need to prototype a form for them.  Or we go through multiple round of UI or UEx development and we're more concerned with getting the input mechanism right than the storage piece.  Or we just aren't sure the project will even get funded, but we need a demo.

What should be clear is that the EAV is literally a repository of semantic debt.  There are no EAVs currently in existence that are not just buckets of semantic debt.  The pattern exists precisely to enable developers to take on semantic debt and then pay it down later, when the semantics needs to be solidified so we don't have a DATE_OF_BIRTH/BirthDate scenario.  Organizations that don't clean up this semantic debt will always end up with a problem.  Its a guarantee.  Whether that problem is significant to user experience or not is a function of various local conditions, but there's so many variables its functionally random chance whether it'll happen.  The only way to avoid the Parminder/Bob scenario and keep Parminder's EAV is to enforce rigid data governance rules to ensure Parminder's first-cut semantics are not violated.  And in that case, you're not so much in a scenario where you've got an EAV as you are in a scenario where Parminder's data model only uses three tables to store everything.

So its ok to use an EAV to store sketchy semantics, or in situations where you aren't sure quite what the model will be.  But any use of the EAV that doesn't also come with some kind of hard-core data governance plan is preliminary, and needs to surrender to a structured model in fairly short order.  If you're the only one using the EAV then you've got something like a hard-core governance process in place, but probably not hard-core enough: All it takes is one drunken programming session to completely forget what new attributes or entities you've created.

And you can reliably report against an EAV only until its implicit semantics changes.  Which is to say, you report against an EAV at significant risk if you don't have solid change management processes - and who among us has that?  If I were a BI manager I'd strongly oppose building any such reports because sooner or later something will get screwed up.

But you can't use an EAV in the long term without semantic confusion.  An EAV is a machine for generating semantic debt.  Useful in the short term, valid in some circumstances, but not in others.

   

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts