Semantic Debt: Codd's Essential Insight

Codd’s insight will change your life. Some people think Codd's work on relational algebras is simply about how to build relational databases. As Moore's law has progressively erased time and space and everyone has made the shift to NoSQL, the feeling is that Codd's insight is obsolete. But in much the same way the Godel's Incompleteness Theorems are both about classical first-order logic and arithmetic, and also about so much more, Codd's insight and the way he arrived at it are critical to understanding how to do modern data management well. This is true whether you use NoSQL systems and file storage exclusively, or you're buried deep inside the guts of a relational database all day long.

The relational database was invented by an English mathematician and computer scientist named E.F. “Ted” Codd. He started work at IBM during the McCarthy era in the US, and moved to Canada to protest the Red Scare hearings for a decade. Codd returned in the early 60s to complete his PhD thesis at the University of Michigan. His thesis was an extension of Von Neumann’s work on self-replicating digital organisms, also known as cellular automata, and he demonstrated that you could reduce computation to eight precisely defined states. It was while employed as a database developer at IBM’s research facility in San Jose during the 60s and 70s that he worked out a new way of managing data, eventually published as “A Relational Model of Data for Large Shared Data Banks.”

IBM preferred not to adopt Codd’s suggestions, concerned that it would cut into revenue for the IMS/DB system. In response Codd began possibly the first guerilla database program. He demonstrated the value of his model to his customers, who then began to pressure IBM to put the ideas into practice. IBM mucked up the implementation, however, and it wasn’t until a young software generalist named Larry Ellison heard about the model at a conference in 1979, recognized its value, and got some money from the CIA to build a database he called “Oracle” that Codd’s thinking got a decent and workable expression.

Much of that history can be gotten from Wikipedia, and a deeper look is very much warranted. But what is the relational model Codd proposed?

Before answering that question, two quick lessons about technology operations should be draw from this short history. First, at the same time that Ellison was putting Codd’s thinking into practice for the CIA, Steve Jobs was also implementing concepts proposed by Doug Englebart at Xerox-PARC, namely the “mouse,” the “desktop” and the notion of the “personal computer.” Xerox did as poor a job with those concepts as IBM did with Codd’s notion of relational data management, and both companies obviously missed enormous amounts of revenue because of their mishandling. The thinking originated by Englebart and Codd made modern computing infrastructure possible, including the current era of ubiquitous computing. Had it not been for their insights into the practice of using computers to automate the access and storage of information, the computer-as-big-as-a-house period and the “databases are just text files” paradigm would never have ended. Codd and Englebart used their practical experience with computing workflow to propose better ways of doing things, and those things didn’t just work, they blew everyone away. In both cases the conservative establishment – the theorists - at their respective employers couldn’t see the advantage and either deferred implementation or botched it completely, all while the new operational paradigms dramatically changed the industry. The lesson is this: When the “science” is new, the way computer science was in 1979 (and still is) you often get better theory out of the practitioners than you do out of the theorists. This is historically true in physics, astronomy and biology, for example, and Codd’s experience at IBM and Englebart’s at Xerox bears that out in computer science. In practical terms, the intuitions of practitioners often times turn out to be better pointers to big principles than the conservative judgments of the established theories.

Second, note that it was the CIA that funded Oracle. There’s lots of inside jokes to be made here, in a lot of different areas, and the reader is encouraged to think some of them through. But aside from the fact that it wasn’t Codd who got the money, it was Ellison, it’s always a good idea for data management practitioners to look at technologies the national security apparatus finds promising. Whatever you might think of the ethics of mass surveillance in the US, China, Russia or the UK (for example) there are some very smart people in those agencies making de-facto recommendations about data management; they have billions of dollars in budgets and very hard use cases and those recommendations might come in handy for your users; and while the ideas may or may not work for specific local use cases or anywhere else, they can be a handy early warning system for technological and political developments and at the very least make the data management practitioner look in the know in front of the boss.

So let’s go back to Codd. Codd’s PhD thesis was an attempt to reduce the number of mechanisms needed to create cellular automata. “Cellular automata” are models of workflows or processes using as few and as simple mechanisms as possible; think of the “simple machines“ used to describe the foundations of mechanical engineering, but abstracted into other domains. The development of cellular automata is characterized by the identification and reduction of specific functions - such as replication, resource use, or resource acquisition - to the minimal number of building blocks required for a machine to perform a maximum diversity of activities. In business terms, what are the least number of components you need to make a robot? In mathematical terms, Codd’s goal was to reduce processes to the least number of operations. In more modern terms, Codd was looking for the most basic functions of computing, that could then be compiled into more complex operations.

The notion of “reduction” will get some airtime later, but consider Codd’s mindset, based on his just-completed thesis, as he goes to work doing database development for IBM in 1965.

At that point data was stored as hierarchies of records, where the records didn’t need to be fixed length or even fixed schema. Each hierarchy participated in another master hierarchy, and the instance chain of hierarchies was recorded or traversed in memory, using application code. This metamodel is very similar to a modern document model like Mongo, at least in theory. The logical data model implemented in such a system is a set of hierarchies. In such a model a CUSTOMER is a hierarchy with a set number of level one nodes, most of which contain data about a specific CUSTOMER (e.g. EYE_COLOR, RACE). Some of those nodes may be pointers to a hierarchy a layer down, e.g. ORDER, where we’d store a set of order records, which hierarchy is then “attached” to the higher-level record. So the hierarchy metamodel is very much implemented the way a modern document model might implement e.g. a CUSTOMER collection: All of the CUSTOMER’s data – including transactions and history – is in one immutable object which can easily be manipulated in memory.

For Codd it must have been a nightmare. For application developers of course it’s a dream – you need a field? Just add it! You don’t have to specify a datatype, or worry about redundancy or duplication. And if performance is horrible or data is corrupted you can (a) blame the database programmer and (b) throw more hardware at it, both of which are great ideas if your parent company sells hardware and professional services.

But obviously no one blames the model. This is where Codd’s instincts for reduction collide with his day-to-day practice. He almost certainly spends a lot of his time resolving corruption caused by inconsistent writes in IMS/DB, and it becomes clear that while it might be mathematically possible for a disk head to traverse the required number of blocks to ensure that a badly-designed model doesn’t leave e.g. EYE_COLOR values of a particular CUSTOMER out of synch in the many physical and logical places where those values might be stored, this approach is already not practical for the data volumes of the day. The data model may simplify things for the application developer, who doesn’t have to consider the consistency of his or her conceptual or object model, but it requires that DBAs continue to escalate hardware requirements and it almost certainly required a significant investment in follow-on defensive programming, to minimize the probability that corrupt data found its way back to the surface.

Codd’s solution is to store only one copy of a datapoint, and separate datapoints about the same thing from datapoints about different things. CUSTOMER data should be stored in a separate structure from ORDER data, and links between the two datasets are established using the string that identifies a unique record in one or the other set. The most obvious benefit of this approach is that writes are minimized: When you change the EYE_COLOR value for a CUSTOMER record you need only do it once, in the CUSTOMER structure, as opposed to having to find all of the various locations EYE_COLOR values are stored in the hierarchies of hierarchies. The old approach required maintaining transactional consistency across not just the string you want to store for all the various EYE_COLOR locations scattered across the hierarchies, but also the pointers you used to find those locations, both physical and logical. (“Logical” because you need to know which hierarchies contain EYE_COLOR fields and values, “physical” because you need to know where on disk those fields are stored.) Codd’s approach requires just one change, and one lookup: Find the logical location of EYE_COLOR, which is stored in one and only one logical location, and then make the change in the associated physical location.

Codd’s suggestion is simplicity itself. For those naturally predisposed to laziness and also skeptical of their record-keeping skills the promise of only needing one location per EYE_COLOR suggests a means to handle much thornier problems of data and process management, the kind that are obscured by more tedious-but-immediate hardware and hierarchy-management issues.

One implication of the approach suggests a way to create these models: If the modeler strives to reduce redundancy in records because creating copies of those records in multiple locations might mean they get out of synch, then the model should structure those records to avoid redundancy in their storage structure. This leads the modeler to consider the notion of “repeating groups” within the records themselves. Here’s how to identify “repeating groups”: For each set of CUSTOMER records there will be a set of fields that make up those records, and some of those fields will consist of discrete values - remember our table of attribute types, in section 2.3? - that recur multiple times more or less throughout the CUSTOMER records. This data is not generally a scalar or integer value, though, as in the case of e.g. BIRTH_DATE values, but data that might be considered components of the master record, such as ADDRESS. A specific address is associated with both a CUSTOMER record and an ORDER record, so while we might be tempted to put ADDRESS strings in both structures, what if any of that changes? (“Strings” is important here: We’re talking about storing the actual address - e.g. “1600 Pennsylvania Avenue, Washington, DC” - in both the CUSTOMER and the ORDER structures.) For example if the CUSTOMER decides to have the ORDER sent to another location, keeping the ADDRESS synchronized across the two structures requires two changes. (Even more important, how do we track the use or change of use of a specific ADDRESS over time? Marketing may not have been a use case when Codd was formulating his insight, but that’s no longer the case.) Codd’s principle tells us that when we identify repeating groups, we should create a new structure, called ADDRESS, assign a unique key to each unique ADDRESS record, and cross-reference that with the relevant CUSTOMER and ORDER records. Then if the CUSTOMER wants to change their billing address but not the shipping address, they can. CUSTOMER, ORDER and ADDRESS data is independently stored, and stable; what changes is the relationships between particular CUSTOMER, ORDER and ADDRESS records. A place for everything and everything in its place.

What’s revolutionary about Codd’s suggestion outside of data and business modeling proper, what makes it useful in database operations and the broader world, is the principle that you should avoid copying or replicating data in multiple places because it might get out of synch. It’s a truism that corrupt data is useless. Codd provides us a way to dramatically and structurally reduce that risk.

Codd’s essential insight, then, is revolutionary in database management for two reasons. First there is the simple fact that it’s dramatically easier to scale data management using a relational model. In an ideal circumstance, where for example an optimal hierarchical model is used by an ideal application, where each insert or update is always managed through some kind of metamodel that exhaustively catalogs each location for EYE_COLOR and each location for EYE_COLOR has identical datatypes and semantics, one might have a hope of managing a moderately complex application, given on-demand CPU and storage. Complexity might increase dramatically, but as long as CPU and storage are provided by fairies, the application is in the clear. But Joy’s Law is about circumstances like this: No matter who you are, most of the smartest people work for someone else. It is simply not possible to maintain the level of logical and semantic continuity between data management and software engineering teams that would permit the hierarchical model to succeed. Indeed even now, with the return of the hierarchical model in the form of NoSQL document bases, development teams are discovering that loosely-knit collaborative teams much larger than two people produce data models that result in significant maintenance. An alternative, which many larger teams pursue, is to lock down change to the model and restrict updates or expansions to the model to a select group of people - where that select group often consists of one person - and completely obliterating the value of flexible schema in the first place. Ontogeny recapitulates phylogeny, even in operations.

Second, Codd encourages data management practitioners to be reductionist about their processes, to determine the most fundamental useful elements necessary to accomplish a goal. A data management team could store data separately for customers, vendors, employees, lawyers, salespeople and statisticians, or they could view all of those groups as agents and manage the variations as needed; a data management team could provide software development teams with the random fields they need right now to store data from a form, or they could figure out the longer-term plan and provide a more coherent place for the developers to put form submits.

This is a very powerful heuristic. Occam’s Razor has done tremendous work in philosophy and science, and applying it to software engineering in general and data management in particular promises to add a lot of value there as well. But Codd provides a motivation for this principle, too, namely in the empirical observation that storing the same datapoint in multiple locations requires separate operations to keep that datapoint synchronized, and separate operations increase the risk of corruption. It’s not just that a single datapoint - e.g. EYE_COLOR - stored in multiple locations may have a lag between the time in which both locations have the same value, i.e. the lag we now call “eventual consistency.” It's that the separation increases the likelihood, over time, that the two locations will become inconsistent and it will not be possible to know which location has the correct value. The more possible locations there are for the datapoint the more likely one of those locations will become permanently out of synch with the rest, because the likelihood that one of the update operations will fail increases.

This second implication of Codd’s insight, that data management practitioners are smart to worry about managing a “single source of truth,” ripples throughout modern data management practice. It is the critical insight underlying all of modern data modeling. It is the foundation of modern ETL architecture, and system and data architecture in general. And it’s a very handy first question to ask when an organization is experiencing data management problems, whether they use standard relational databases to store their data or logging or NoSQL systems to collect the data created in their applications: Is the same datapoint being managed in more than one location?

If it is, then Codd’s insight applies.

Semantic Debt

Codd's Essential Insight

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse