Semantic Debt: 2.3 It takes time to get the data model right

More from the chapter entitled "Data Management is hard," explaining why people often ignore good data management practices in favor of the easy and expedient.

2.3 It takes time to get the data model right

There are people who find data modeling enjoyable and relaxing. While to all outward appearances the task may involve a lot of yelling and angry erasing of whiteboards, the author (for example) thinks that the process of doing data modeling is almost always exploratory, creative, bounded by the mysterious tyrannies of implementation and comprehension, and exhilarating. We’ve heard stories about data modelers in the old days who took years to reach their end state, and while that doesn’t sound like fun it would certainly be great to do what amounted to professional metaphysics all day.

But there are many people who don’t enjoy abstraction, or who may believe they enjoy it but aren’t suited for it by training or temperament. It may also be that work on a data model is artificially truncated by philistine business people or technical managers who think the perfect is the enemy of the good and want to get on with good business. In all those sadly-prevalent cases it’s often the case that what you might call “folk” data modeling becomes the norm. The notion of the “folk” data model is worth exploring for a minute, because it’s key to understanding this difficulty. One of the reasons data management is hard is because it takes time - and thought, and effort - to get the data model right.

Applying the term “folk” to a discipline to distinguish mature models from the unsophisticated - like “folk medicine” or “folk philosophy” - began in the 1970s with two Philosophers of Mind named Patricia and Paul Churchland. The Churchlands argued that much of what passed for theories of how the mind worked in the academic philosophy of the day was “folk psychology.” Folk psychology begins with the assumption that the words people use to talk about what’s happening in their minds somehow reflects the underlying nature of what’s actually happening in their brains. So people talked about their beliefs, desires and intentions - or at least academic philosophers did - and assume that there were components of their brains or their mental states that corresponded, no matter how diaphanous the connection, to those specific things. All psychologists needed to do, the thinking went, was find neurological states that matched up with a belief about e.g. ice cream, and then this whole question of what people were thinking when they thought of ice cream would be answered. To be more precise: Philosophers of Mind thought that when they said “I believe X” there was a thing in their brains that was a belief that X. These “belief” things were different from “intention” things, which were different from “desire” things, and beliefs that X things were different from beliefs that Y things. The “thing” didn’t have to be a single unified object; it could have been a specific collection of neurons. But the idea was you could point to that specific collection of neurons and say that’s what the belief that X was. If you removed that collection of neurons from their brain, then a person would no longer believe X.

The problem is that there are no actual neurological entities corresponding to a belief that “ice cream is delicious.” There is no unique configuration of neuronal potentials - nothing a machine could sort of snap a picture of - that occurs when someone believes that “Ice cream is delicious,” and another unique configuration that occurs when that same person believes “Ice cream is cold.” When pictures are taken of the brain of a person thinking those thoughts about ice cream, the two thoughts almost appear to happen in separate brains. So the Churchlands argued that the notions of belief, desires and intentions were just “folk psychology,” i.e. a handy conversational shorthand that had no scientific validity. Using “belief” as more than a conversational shorthand, to their thinking, puts too much weight on the notion. Moreover the folk use of “belief” didn’t have any actual scientific predictive value, because the stuff the human brain does has a completely non-linear relationship with its owner’s stated beliefs about ice cream. For the Churchlands, any scientific study of the brain and human behavior required actual science, and not attempts to discern the relationships between folk psychological notions like “belief,” “intention” and “desire.”

The idea of “folk” theories is a very useful one. Hand a child a bowling ball and a feather and ask them which will fall faster. If they haven’t taken any formal physics yet, they will almost always respond with the folk-physics intuition that the bowling ball will fall faster. More folk physics: The Sun revolves around the Earth. A prevalent folk-economics claim is that a nation’s budget must follow the same rules as a family’s budget, just on a larger scale. It was once a common misconception that insects could hatch out of thin air, for example, which is an example of folk biology. (Many of these folk-scientific notions are collected under the Wikipedia entry List of common misconceptions.)

The way the world actually behaves, in each case, is not just different, but very different than the folk model. In much the same way that the brain doesn’t contain “beliefs” except in a very limited, tenuous and not-scientifically-useful sense of “contains,” the behavior of all objects with respect to a central mass can be described equivalently for bowling balls and feathers, and the economic forces operating on, from and within a nation bear no relationship to those buffeting a family. In each case there may be a desire for the folk-level world to be different, and there may be conformant explanations at some degree of abstraction or exception for the observed behavior, but the folk explanation fails to be predictive or scalable.

Data modeling often proceeds along folk-theoretic lines. Recall the CUSTOMER.RACE example from the previous section. There is such a thing as a CUSTOMER, and one of the attributes of that thing is that it has a RACE. So how do you model that thing and its attributes? The folk data modeler will create a table called CUSTOMER, identify some kind of identity conditions such as a person’s national identifier (e.g. the SSN in the US, or the SIN in Canada) to govern insertion of new records into that table, and include a field called RACE in the table. This is all well within the skillset of a beginner data modeler. The application developers will dutifully insert records into the table following the modeler’s guidance, and all is well with the world. And then the magic day happens when, for whatever reason, the application suddenly needs to record two RACE values. It’s possible that there will initially be only two values, perhaps due to some agreement on scope or vision. The folk data modeler looks at the concept of CUSTOMER encapsulated in their table, and the field RACE, and their mind is blown. Who would have thought that one person could have more than one RACE value?

Well, it’s not that much of a surprise. Race is a social construct, and as such any modeling exercise around the idea of storing race for something as specific as CUSTOMER argues for a defensive posture. You don’t have to be multiracial or a literary theorist to understand that “race” is a social construct, you just have to look at the type of thing you’re modeling. Here’s a quick couple of thoughts on how a good modeler avoids getting trapped by folk conceptions.

Consider that the entity CUSTOMER has a lot of different kinds of attributes. First, there are natural-key-type attributes that determine who an individual is, such as SSN or SIN. The assignment or creation of these values are generally outsourced to a government entity or some declared external source of truth, and they’re often alphanumeric values like GUIDs or gigantic integers. The values of these attributes will never change, as they literally define a person.

Second, there are discrete secondary and/or transient attributes that virtually every CUSTOMER possesses but which can be expected to change completely and even often over time, such as ADDRESS, CONTACT_MECHANISM or even MARITAL_STATUS. Everybody moves or gets a new phone, so we need to be able to store these changes.

Third, there are attributes that every CUSTOMER has that are not identifiers, but that will never change and are discrete, such as DATE_OF_BIRTH, COUNTRY_OF_ORIGIN, BLOOD_TYPE or EYE_COLOR. You can’t identify a person with these attributes the way you can with natural keys, but they are still critical to many aspects of a person’s identity. Notice also that while two customers with the exact same address likely have a relationship, it’s highly unlikely that two CUSTOMER records with the same DATE_OF_BIRTH value are related.

Fourth, there are attributes every CUSTOMER has that are not identifiers, but that may change over time. They’re also often continuous: HEIGHT and WEIGHT are examples.

Finally, there are attributes a CUSTOMER may have that are assessed by outside agencies or over which the individual has no control, but which might be either continuous or discrete, and may not change much at all, such as CREDIT_SCORE or VISA_STATUS. The table below summarizes the options.

Attribute Type	Example
Attributes that identify the individual or are natural keys or primary, assigned, permanent, discrete	SSN, SIN
Primary, inherent, permanent, discrete	DATE_OF_BIRTH, COUNTRY_OF_ORIGIN, EYE_COLOR
Primary, inherent, transient, continuous	HEIGHT, WEIGHT
Secondary, inherent, transient, discrete	ADDRESS, CONTACT_MECHANISM, MARITAL_STATUS
Secondary, assigned, transient, discrete or continuous	CREDIT_SCORE, VISA_STATUS

Can the same CUSTOMER record have two EYE_COLOR values? Would we say that someone has both blue eyes and green eyes? David Bowie famously had two different colored eyes, leading us to answer, at first thought, “yes, a CUSTOMER can have two EYE_COLOR values.” But that’s not really what we’re asking. The question is, “could (for example) a CUSTOMER be reported as having one EYE_COLOR value by someone and then another EYE_COLOR value by another? And would we need to report on that?” The answer is: “well sure.” You might have to report on EYE_COLOR, and there might be more than one value. Perhaps the data is for a law enforcement agency, or precise gradations of color are required for security or authentication purposes, in which case there would be some value in allowing a CUSTOMER record to store more than one type of EYE_COLOR.

But the EYE_COLOR attribute is not the same as RACE, for a number of reasons. For one thing, no one is really discriminated against because of their EYE_COLOR, at least in the US. It’s the actuality of unjustified discrimination based on race in the US that led to a desire to record the race of individuals, both to determine whether organizations were discriminating and who might require remedies or defenses against such discrimination. Thus, absent regulatory or specific law-enforcement or security requirements, it’s not clear there’d be any need to store multiple values of EYE_COLOR. For another, because it’s not a social construct, conceptions of the values of EYE_COLOR won’t change over time. BLUE is still BLUE, fifty years on, while the same cannot be said for ASIAN, HISPANIC or especially WHITE.

But suppose EYE_COLOR is like RACE. The folk model of CUSTOMER has already failed for RACE, so maybe it also fails for EYE_COLOR. In both cases the only remedy apparent to the folk data modeler is to add more fields to the CUSTOMER record. Consider that if the CUSTOMER record includes an ADDRESS field or fields, and the CUSTOMER changes their address values, business logic may just dictate an update to the field values to the latest. But if the CUSTOMER claims two RACE values (or two EYE_COLOR values), what are the options? The standard folk-model approach is to add another field to the table to accommodate the new value. That means that there’s a CUSTOMER.RACE field, and somewhere on the far right side of the table, a CUSTOMER.RACE2 field and maybe also CUSTOMER.EYE_COLOR2. What about tri-racial people, or people with one green eye and another eye that’s goldish with flecks of grey? When that case arises, there’s call to add CUSTOMER.RACE3 and CUSTOMER.EYE_COLOR3, and work with the application developers to ensure there’s some kind of validation in place. At some point the table will run out of space or the customers will run out of potential combinations and this process will stop, but no one on the database team will know that it’s stopped. In the folk model of RACE, the data modeler simply appends an additional RACE attribute to the table to accommodate cases of people with more than one RACE value.

Or instead of using a folk model of CUSTOMER.RACE we could simply build another couple of tables, one to store RACE values, and one to cross-references the values in the RACE lookup table with the CUSTOMER record. (This RACE table may seem to have popped into our story as if from offstage, miraculously, but it’s almost certainly in the database to support the application’s drop-down box.) This would allow a CUSTOMER to pick one or more RACE values, as many as was necessary. A similar approach would work for EYE_COLOR, or any other field where we may need to store multiple values, but not necessarily, and where those values may change over time. In fact even most folk-modelers will approach the relationship between CONTACT_MECHANISM and CUSTOMER this way, especially when it comes to phone numbers, allowing that putting both WORK and HOME phone number values on the same record seems kind of redundant. Instead we could cross-reference an individual with a CONTACT_MECHANISM record containing the phone number, and then type the same CONTACT_MECHANISM record in multiple ways. (“Type” as in “the record might be of multiple types of things.”) But there are many examples of this kind of data. RACE is not like DATE_OF_BIRTH, but it’s also not like ADDRESS or CONTACT_MECHANISM. In a folk model there’s a reluctance to see the future uses of the attributes stored in the model. The emphasis is often on creating a place just to put data, instead of thinking about the things the data is about, the relationships between those things, and how those things and their relationships may change over time. The folk data modeler doesn’t think of our matrix above, they just think in terms of fields.

Notice something about our non-folk model: It appears to add complexity in the short-term in exchange for simplicity in the long-term. This feature of the correct, non-folk model will often be used against it. Also, there may be a perception that it’s hard to understand, or that there’s simply no need for multiple RACE values and moreover there never will be.

In reality, to respond to the first point, once the data management practitioner has some experience with the more appropriate model and written some access methods - whether procs or INSERT statements or whatever - the better model will become second nature and the organization will be spared the headache of figuring out why there’s a value in RACE3 but not in RACE1 or RACE2.

Each complaint can be trotted out to explain why a good data model is optional - that the correct data model is counter-intuitive and as it contradicts the folk model no one really knows what the right answer is, so there must not be a right answer, so any effort is wasted. The fix, of course, is easy, if there’s someone around with enough experience with data to apply the heuristic and avoid the folk model, even if they know nothing about the logical foundations of the heuristic.

It should be noted that the previous discussion related specifically to a model in a relational database. So-called NoSQL or schemaless or schema-on-demand databases allow the developer to implement folk models with seemingly no net penalty. They can add fields and modify cardinality to their heart’s content, secure in the knowledge that the database engine and their own code will mask any issues. In the case of the RACE example, that may be true: It’s simple enough to append a RACE-N field to a document model. But the NoSQL model begins to fail when an individual CUSTOMER record is also a VENDOR or EMPLOYEE and expects attributes entered through one UI to be shared across all of their representations. In such a case the folk model of CUSTOMER won’t work, and can’t be fixed as easily as we fixed RACE, either with ad-hoc additions of fields to CUSTOMER, or with our correct RACE-cross-reference table. That explanation will take more space than is available but the outlines of the solution, of a correct model of {CUSTOMER, VENDOR, EMPLOYEE}, should be closer to clarity for the reader.

Again, with a little experience this work can be done quickly upfront, using well-known patterns, preventing the organization from fielding complaints from customers who don’t understand why you’re using the wrong ADDRESS for them. These models have been in use for decades, which makes the excuse “data modeling is hard” really pretty weak. That doesn’t make it any less common.

One reason data management is hard, then, is because the prevalence of folk data models means it takes time to get the data model right.

Semantic Debt

Monday, February 12, 2024

2.3 It takes time to get the data model right

2.3 It takes time to get the data model right

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse