Semantic Debt: 4.3 How and why to build an Operational Data Store

Nobody knows how to build an ODS, or why you might build one. I've had many many many arguments with "data warehouse developers" over the years who assume that the Kimball-style Analytics warehouse is the only kind of warehouse facility there is. This mistake results in bad systems design, because the ODS solves a specific set of use-cases that a Kimball-style system just simply can't.

In this chapter, which comes after a discussion of Data Lakes and source systems, I explain how and why to build an ODS.

4.3 How and why to build an Operational Data Store

In this section we’ll first discuss why the Operational Data Store or ODS, also known as an Inmon-style warehouse, is often negatively compared with the Kimball-style warehouse. Then we’ll talk about the use-cases satisfied by the ODS, and the steps for constructing one.

The Operational Data Store is the next obvious logical evolutionary step in data management systems development, after an organization has explored what a Data Lake can do. The ODS is also the most misunderstood system in the evolutionary process, from a development standpoint, and experienced developers are rare. They’ve fallen out of favor in recent years, in part because it requires what appears to be a more skilled data modeler than, say, Star Schema-based data warehouses or “Kimball warehouses,” or the Data Lake. An ODS also appears to fall short in cost-benefit comparisons with classic Kimball warehouses.

This last point is worth exploring in some depth, because the reluctance to build an ODS in favor of an Analytics or Kimball warehouse has been the cause of a lot of bad systems development and contributes a lot to the negative reputation of data warehouse projects. The ODS is critical to the CQRS pattern beloved of application system designers (as discussed in the Patterns of Persistence section) and can play a clutch role in an organization’s data management ecosystem. While it may be a blank spot on the map for most practitioners, the ODS is both extremely useful and a surprisingly simple system to put together.

As anyone who’s done an interview with professional data warehouse people knows, there are perceived to be two grand theories of “back-end systems design.” Both are thought to be the alternative centerpieces to a reporting, business intelligence or analytics strategy. Both theories are nicknamed for their progenitors.

The Kimball approach is named for Ralph Kimball, and uses a star schema (and the derivative snowflake extension) as its central data model paradigm. Many people like the Kimball approach because they believe it allows an organization to get to the analytical point quicker. The star schema is an intuitively obvious solution to analytical problems, and it provides enough of a data modeling challenge - but not too much - to justify a developer’s time. The star schema makes it easy to do retrospective analysis of the states of a thing, and while perhaps the analytical point has been missed, that doesn’t stop the partisans.

Historically the “Inmon approach” broadly speaking was perceived as delaying the creation of the Kimball warehouse or the Analytics system, in favor of first building a series of abstractions and management facilities to clean and organize data. Bill Inmon initially referred to this approach as the Operational Data Store or ODS. He later superceded the concept of the ODS with that of the Corporate Information Factory (CIF) to indicate there’s a larger set of systems here than just an ODS. Inmon’s expansion of his concept was because many people missed the point of the ODS, both in minimizing its value for operational data and maximizing the prep work required to get it right. While the CIF is a critical expansion of Inmon’s overarching program, we’ll continue to refer to the data model that satisfies the use-case we have in mind as the ODS. But put simply, the CIF is a set of facilities that should be common to everyone’s data warehouse ecosystem, and not just the ODS. We think the conflation of the ODS with the CIF misses the point of both. That doesn’t mean people want to wait to do the right thing, and whether CIF or ODS the perception is Inmon is forcing more work into the plan than is necessary.

People dislike Inmon’s approach because what they want to do is simply query data in an easy-to-use format, which is perceived as “analytics,” without having to build the large quality-management infrastructure Inmon seems to be forcing on them. Inmon, in short, appeared to be making people eat their vegetables or do their laps even though there was no apparent need. Data management practitioners who’ve dealt with an executive management team unhappy over an overdue “data warehouse” project understand the attraction of choosing the seemingly short-and-simple Kimball warehouse timeline over the seemingly more arduous Inmon plan. When things are done the Kimball way it’s reasonable to expect an Analytics system in three months. The Kimball solution may only take a quarter-year because generally and accidentally there’s little integration cost for its source data, or because the solution usually ignores the quality-management frameworks and precision expected of an ODS or CIF. But with the Inmon approach, in contrast, it appears there’s a need to patiently explain to management that the team is building things carefully for the long term. And as we all know, there’s nothing busy and anxious executives like more than the patient expense of money on the carefully-planned long term.

So Inmon’s ODS system design has fallen out of favor, because it’s been sold as harder to do than the Kimball warehouse. Inmon himself is perceived as intransigent about the need for long-term thinking. The ODS has a cranky-old-DBA feel to it, the kind of thing we all know we need to do but don’t have time for. And besides, Kimball partisans will claim, the Kimball warehouse can logically do the same thing as the ODS. Put in these terms, it’s difficult to make the case for the ODS.

But these terms miss the point. First, the “O” in ODS stands for “operational,” which in common usage means “current,” as in the current business’s current operations.

The ODS is focused on answering questions about real orders that are currently undergoing changes, or real customers with real questions about their existing states, or current inventory for live products in existing warehouses. The Analytics or Kimball warehouse is designed to track state changes in a transaction over time. Each core record in a Kimball system is a fact about the state of a thing, with a wide series of attributes associated with that core fact that can be used as filters to find relevant facts. Each Analytical model is designed to aggregate data across billions of those facts quickly. That’s a fascinating use case, the aggregation of the numeric values of billions of facts. But it’s not the operational use case, and logically equivalent isn’t physically preferable.

There’s often a specific business need to quickly understand what’s happening to a particular transaction or process right now. Not all of this use-case is sponsored by our friends in customer service; applications may need this data too, to populate a CQRS pattern for example, to supply status pages or make real-time decisions about how to govern working processes. We also want to know, in aggregate, precisely how many orders or products are in a specific state, for example how many orders are currently in “backorder,” so we can create task lists for our people.

In a Kimball warehouse each of these facts about an order’s current state will be mixed in with a whole lot of other facts that are not current and thus not relevant to the use-case. To get answers to the question, what is the current state of order X?, a Kimball warehouse needs to search through all the billions of records not of interest to find the right one. Moreover because of the scale of the data loaded into a Kimball warehouse, latencies on the load will often be such that even the latest data we’re looking through is old, older maybe even than last week. When a customer wants to know their most recent status as of this morning, day- or week-old data is not sufficient. The Kimball warehouse is organized in such a way that we often need to denormalize the fact table multiple times to get a complete picture of the order, stringing the fact table together to try to reconstruct the order process. The database brands used to persist Kimball warehouses excel at summing up numeric values through billions of records, not stringing together single facts into a big wide record of single facts in seconds. They are not transactional, they are analytical.

Further because of the size of the set in the Analytical warehouse, there will be quality tradeoffs. A long enough fact table will miss some statuses, and literally make up the difference in the aggregate trend. The ODS requires us to be precise about the current status and deduplication of the things we’re reporting on, because it’s our friends in customer service who are on the phone with that customer talking about this specific order. We intend to maintain the same transactional precision as our source system in the ODS, and imprecision and latency are not our friends. This is the diligence perceived to be the hard part of Inmon’s approach: The ODS requires carefully deduped and mastered entities, because we’re reporting on the state of individual transactions. The Analytical system, while still perhaps a stickler for quality, is intended to support trend reporting and even at its most stringent a certain allowance must be made for slippage.

The ODS doesn’t have to be difficult to construct. If an organization has a well-used Data Lake they’re likely already halfway there. An organization will often hire in Kimball-partisan warehouse architects and developers and discover the resulting warehouse is nowhere near as useful as they thought it was, often because the developers and architects slavishly followed a pattern without any idea of its ideal point. This lack of one-size-fits-all utility to Kimball warehouses contributes to “data warehouse failure,” not so much because the construction of the Analytical warehouse failed to capture the state changes of the source data, but because what was needed was an ODS. It may be that some of Inmon’s intransigence is as a reaction to a deep misunderstanding of the pattern of architecture required to construct an ODS, and the degree of difficulty in doing so. In our experience the cranky-old-DBA mood arises as a reaction to Kimball-oriented people who believe anything that can’t be fixed with their star schema-shaped hammer isn’t fixable.

Let’s switch our focus squarely to the ODS.

Consider, for example, the following use case. A customer at a large multinational online retailer wants to know the current status of their order. The customer has placed hundreds of orders with the retailer, going back decades in several countries, and each of these orders is spread across physical tables in several source systems each holding data about various components of the orders. All of those orders have been closed except the one the customer is calling about right this moment. Because our retailer has a same-day delivery guarantee personnel need the data as current as possible and right now. How does the data management practitioner help these people out?

Logically, we answer the question about the status of an open order with:

A model that co-locates all of the data about the current active order in as few physical tables as possible.
A model that emphasizes low latency, for querying and update.

There’s no logical reason, given these requirements, that the model shouldn’t consolidate the various components of the order into one wide table, where each table record represents e.g. an item (or a “detail” record on the order), and the order’s relevant “header” data is replicated on each line. Orders are orders, after all, and have a pretty generic structure the world over. There may be local variations on the order concept which can be acknowledged, celebrated and modeled into our reporting system, but as a starting point we’ve got a pretty good idea of what the denormalized order table should look like. And as this order’s items are each represented as one record, we have a pretty simple way to update the various statuses we might need to represent on this order. The financial statuses such as credit-card clearance, the inventory picking, the shipment preparation and final waybill assignment all the way to confirmation of delivery from the carrier, can each be represented in their own field in the same wide table, to make it easy to represent and query these statuses. This structure also makes it easier to update the status of each field from various sources in a standard relational database, locking only the relevant row, permitting many rows to be updated at once without locking the entire table. (Table-level locking is very common in the analytical databases designed for Kimball warehouses, by the way, which is useful when loading large amounts of data but a feature that in an ODS trickling in small lots of new orders hourly would make it virtually useless.) As the individual source systems responsible for the various steps in the logistics of customer satisfaction eventually update this record, we can publish some expected degree of latency to the customer-service rep looking at the record, providing some confidence as to the likelihood the status as reported is as close to the current status as our pipeline will allow. As an extra added bonus, we can copy the facts created by each status change into a separate staging area, for use in an Analytics system.

A quick side note on our concept of order and ODS entities in general. When a customer or customer service rep considers an order, they don’t think of just the contents of the SALES_ORDER table in the ecommerce engine or the central transaction database, they think of the entire business process. Similarly when Marketing thinks of the customer they aren’t just thinking of the contents of the CUSTOMER table, they’re thinking about the customer’s first purchases and their last, their current geographic and cyberspace locations and orientations, and a variety of other attributes that are likely spread across a variety of tables, some of which might be Analytical in source. The ODS is designed to enable these wide concepts to be easily queried and updated. Below we’ll talk regularly of “orders” or “customers,” but we should be understood to be referring to the widest possible use of those terms. Not just the SALES_ORDER, then, but also the SHIPMENT, the DELIVERY, the RETURN, and various other components that make up the order process. The breadth of the concept we’re trying to represent as a single entity here is ultimately dictated by the end-user. If we’re supporting CSRs, we want to represent their concept of “order.” If customers are our end-user, then their concept of “order” is operational . And so on.

Because our wide, many-attribute ODS_ORDER table only stores active orders and items - and it’s probably best to include all the items on an order until they’ve all been delivered - we can afford a little redundancy. We’ll thus default to including many fields that may apply to only some components of the order, or some lines. If there’s no data, the field is simply left NULL. We’ll create each order-item record as soon as we can, and update each status on that record as soon as we can, but we can do that via batch updates far away from the source systems so we don’t interfere with their collection processes. With those architectural principles in mind, the steps to creating an ORDER table in an ODS are roughly as follows:

Go find an order-management data model on the Internet, one that includes shipments, deliveries and so on.
Denormalize it as liberally as possible by joining each leaf node to its parent node.
Add in any oddball fields you may have in your own order management systems.
Calculate the number of active orders in those source systems and provision database space accordingly.
Build an upsert operation against that new table set from each source using whatever best practices are associated with your data engineering tool.
Index the table for deduplication and a couple of common search strings.
Provision some mechanism for deleting CLOSED orders out of the ODS table past a certain age, e.g. two weeks.

You’ve just built the core of an ODS_ORDER model, usable by customer service, finance, website developers and report developers alike.

This process will be developed in more detail below, but this is an ODS in action: the consolidation of operational data, or data which reflects the current state of the abstracted transactions of the various source systems. Some source systems manage shipments, others deliveries, and others the core of the order itself. The ODS_ORDER table created above to satisfy this use case can contain a denormalization of all the data stored across the various source systems. But because it just reflects the current status of the order, cardinality issues are minimized. We don’t have to create a new record for every status change, because it’s preferable to store the status change inline, as a field. Tables can be denormalized with confidence, combining order lines and headers, shipments and deliveries in the easiest way possible, concerned only with the easiest update and fastest output path.

Contrast this supposedly more complex, less versatile, and more resource intensive approach with the use of a classic Kimball warehouse to solve the same problem. A Kimball warehouse would represent the bare bones of the order across each status change, from booked to delivered, as a set of facts, which isn’t particularly helpful when I want to know a bunch of status changes. The order instance is a large, co-located collection of facts about the order itself, the shipment, the delivery, and so on, any one of which processes might be relevant to the consumer. The ODS model wants to represent the entire order, which as noted above is spread across multiple records in (likely) multiple systems. The Kimball model wants to represent the order process as multiple facts.

A couple of alternate avenues suggest themselves to the ingenious Kimball warehouse developer. One is to take the design approach argued for later in this book, which is to represent each status change on the order as a record in a fact table: each status change is itself a fact about the order, which is a pretty uncontroversial suggestion. In a standard Kimball design process, then, an order with at most ten statuses in an ERP system would be represented as at most ten fact records. As details of the order process are discovered in other systems, and their statuses teased out, each of those status changes represents a new fact about the order, and a new row in the fact table. But that would mean, as mentioned above, that to find the latest of the relevant statuses for a customer service rep, the entire fact table must be queried multiple times. First to find the maximum or otherwise relevant dates on each of the relevant fact records, and then second return to the fact table to find the statuses associated with those dates, then third to discard the irrelevant ones, and finally fourth to return the relevant result to the user. For every single relevant fact there are as many as nine others that must be discarded on the way to the truth, which requires a minimum of four queries where our ODS requires only one. Alternatively the fact table could be partitioned so only the most recent facts are easy to get to, and this is theoretically easy, because the fact table has been structured to make such partitioning queries fast. But the relevant question at 50k feet is, why look through a bunch of data that isn’t needed - isn’t needed because it simply isn’t relevant - to answer the CSR’s question?

Moreover there are competing priorities: The operational data report developer wants a lush, fat record, replete with possibilities, so they only need to go to the database once to get everything they need to satisfy their customers. Analytical users, on the other hand, want a spare fact with heavily normalized dimensions, so they can crunch through decades worth of facts about order processes without having to incur the overhead entailed by sifting a bunch of stuff irrelevant to their current question.

But the Kimball practitioner is not out of clever options: If the heuristic described above is rejected, that one of the useful definitions of a “fact” about orders is facts about changes in status on the order and its components, then they may move to a less logical notion of how the fact table should be defined. In our experience that tends to fall into what might be called “the kitchen-sink fact.” In its more elegant interpretations - and these sorts of things are never really elegant - the order fact record looks like the source system records, with additional fields appended as they’re discovered. So the fact record ends up as a denormalization of all the source fields with relevant order data. Instead of stacking relevant status changes in a fact table for later analysis, fields are added to the right of the table’s primary key, for each interesting status. For efficiency’s sake a record may be created for each line or item on the order and duplicate header-level statuses, updating fields from NULL to the relevant value when a value becomes available. Analytical users have a lot more hoops to jump through to get their data now, but the operational users are happy. And no wonder: The fact table now looks exactly like what would be found in an ODS. The Kimball designer hasn’t built a fact table, they’ve built a denormalized order table showing the current values for each of the relevant fields on the order.

However, the Kimball warehouse designer insists on building “dimensions” to go with their “fact” table, and this leads to problems as well. In general a dimension is intended to provide a series of filters for an analytical user, to make the definition of relevant analytical subsets easy. While many of those filter values will also be of use to the operational user - e.g. age and gender can be useful before picking up the phone to talk to an irate customer - others will be of no use at all, such as race or marital status. Further the specific customer-related fields that are often necessary for operational users, such as mailing and billing addresses, are perceived as ungainly and extraneous to Kimball warehouse designers, and generally get added to the dimension tables only after serious (and legitimate) concerns about the elegance of the Kimball model have been overcome.

Building a table in an ODS, then, is a matter of putting together a record that assembles the current or latest data in source transaction systems, across as wide a business process as possible. It’s best to denormalize, and not use a star schema. Building an ODS is often as simple as undoing the normalization exercises the transaction system designers engaged in. Allowances need to be made for cardinality and audience, obviously: If end users need for example all delivery records but just the current status of an order, a single combined record for order data plus the current status for delivery will be sufficient, with a separate table showing each change in delivery status. But the simplicity involved in building an ODS makes it difficult to see why so few organizations take advantage of them.

Once the organization comes to realize how simple and useful an ODS is, two paths can be taken in construction. The first approach and by far the fastest takes what we’ll call the “Nominalist” approach. The Nominalist approach assumes that all and only the attributes in the source systems are required in the combined operational table, and adding any additional fields to the table is purely speculative and might moreover lead to variable or ambiguous mappings from source systems that will cause problems later. The Nominalist ODS doesn’t, in practice, require much data modeling or abstraction. In a Nominalist ODS tables are assembled simply by constructing supersets of source table queries. In fact the simplest way to do this for orders, deliveries and shipments is as follows, using the CREATE TABLE AS construct:

CREATE TABLE DEV_ODS_ORDER as

SELECT oh.*

,ol.*

,shp.*

,d.*

FROM order_header oh

INNER JOIN order_line ol

ON oh.header_id = ol.header_id

LEFT OUTER JOIN shipment shp

ON ol.line_id = shp.line_id

LEFT OUTER JOIN delivery d

ON shp.shipment_id = d.shipment_id

The table created here, named DEV_ODS_ORDER because it’s designed to be a temporary development asset, will have all of the fields in the ORDER_HEADER, ORDER_LINE, SHIPMENT and DELIVERY tables. Because of the LEFT joins as well it’ll have all of the records from the source tables, whether there’s delivery records for an order or not.

Many of those fields might be NULL by default and thus empty, and many field names will be duplicated across those tables, particularly the metadata fields such as creation and update dates, and foreign keys. So the script above can’t just be cut-and-pasted into a query tool and executed. When something similar is executed in a local context, the data management practitioner will need to identify duplicate fields, choose one as representative, and remove the redundant. In general, when choosing between duplicate date fields, the field from the relevant leaf node should be picked. Or the fields should be renamed to reflect their semantic content. There may be four fields called LAST_UPDATE_DATE, which will be renamed to HEADER_LAST_UPDATE_DATE, LINE_LAST_UPDATE_DATE, SHIPMENT_LAST_UPDATE_DATE and DELIVERY_LAST_UPDATE_DATE. There’s some utility to keeping all of these fields, and since storage is cheaper than it’s ever been, why not?

The script for the DEV_ODS_ORDER table, after field names are regularized, an artificial primary key is added, and some useful metadata fields - such as the load and update dates for the record, and possibly the source system and the like - have been included, becomes the basis of a production ODS_ORDER table in an ODS. Local circumstances may dictate splitting this table up. Perhaps the way source systems structure the entities or that end-users need this data argues for splitting the denormalization script in two, where one script joins line, header and shipment records and the second scripts just delivery records. Some kind of delivery status field may still be desirable in the ODS_ORDER table, but there are users who need each delivery’s status change. In that case, all of the delivery records are operational data to that audience, and a table should be constructed to meet their needs.

The same DDL used to create the production table can be used to create the staging table, with additional fields to manage foreign key updates to the stage table, if necessary. (For example, if there’s a CUSTOMER table in the ODS that consolidates customer-related data from across the organization’s various systems, then the local ODS_CUSTOMER table’s primary key should be propagated to the relevant ODS_ORDER records. Keep the original CUSTOMER identifier, but include the ODS_CUSTOMER foreign key to make joins between ODS_ORDER and ODS_CUSTOMER more efficient.) If the team’s data engineering tools don’t require a staging table, then as long as they’ve identified natural keys - which in this case are easily identified as the simple concatenation of the source primary keys of the source tables - there’s something to hang an upsert on.

The alternative to the Nominalist approach to ODS design is the Realist approach. A Realist will start with a theory about what an entity like ODS_ORDER should look like, and test that theory against fields from source systems. The Realist may even go bigger than entities like order, and propose a conceptual model of how the business works, translated into tables or subject areas.

Let’s take the simple example first. Consider that every order has a quantity, a price, an order date and a completion date. There’s likely also some reference to the thing being sold, whether it’s a product or a service. An order will have a date by which that product or service is expected to be available to the customer, and it certainly also has a CUSTOMER, and a location for that CUSTOMER. This collection of observations makes up the Realist theory or model of the order, and various representations of that order can be mapped from source systems directly into that theory, while considering the addition of fields that supplement the theory of the order. Some of those source systems will contribute some attributes to the model or theory, and those can be added to the order model as they conform to the model or the use cases it’s intended to meet.

Realism is a particularly handy approach when multiple source systems have their own order data models, with roughly the same attributes. There will be quantity fields in each source system, for example, some represented as QTY, others as QUANTITY, and so on, and they can be mapped into the Realist ODS_ORDER model. The Realist doesn’t worry about redundant fields in their source systems, such as multiple LAST_UPDATE_DATE fields in their source’s header/detail/items tables. The Realist simply doesn’t care: If their conceptual model of an order requires tracking changes to header, detail and item records individually, then they’ll include those fields sui generis in the table(s) that implement their model.

The Realist finds a lot of possible redundancy where the Nominalist is content to let well enough alone. Orders in their source systems will have statuses, payment terms, ship-tos and bill-tos, pricing schemas, and a whole host of possible attributes and attribute sets, all of which could be created as fields in the ODS_ORDER table, if you’re a Nominalist, or which need to be abstracted, if you’re a Realist. The Realist thus faces the possibility of reconciling different “payment terms” conditions, for example. One system may allow 0-15, 16-30, 30-45 and 45+ terms, while another permits 0-10,11-30,30-60 and 60+. The Realist will of necessity have to resolve inconsistent statuses. Some organizations call “booking” the act of creating the order in their accounting system as an obligation prior to billing, while some reserve “booking” as a state subsequent to billing when the order is booked as revenue. This is the task the Realist takes on, however, with relish: Arriving at some kind of overarching theory about how the organization manages orders will require forcibly converting data. The Realist may choose to pick one status scheme over the other, or they may choose to create a larger all-encompassing scheme that can accommodate both without introducing any weird provisos. But the task of the Realist is to create what amounts to the Platonic Ideal of the entity they’re working on, and then more-or-less map each source’s version of the entity into the physical table(s) created to reify the model.

And there’s a lot of value to just such an approach. In the Nominalist version of ODS_ORDER table all of the fields in the source systems are present, some of which may need to be semantically normalized in the reporting layer. That is, at some point someone has to find a way to map all the various ORDER_STATUS variations into something consistent for Finance. In the Realist version there’s only one set of ORDER_STATUS fields, agreed upon by all using some kind of governance process - even if that process is just dictated by the modeler - or at least normalized in the pipelines. The organization may have multiple payment terms schemes. But the Realist ODS allows them to consolidate their reporting on one such scheme. And isn’t that what everyone wants anyway, a single source of truth?

But there’s a lot more work involved in the Realist approach: A lot more experience with different versions of orders, a lot more consensus or agreement-seeking from the various stakeholders, and a lot of research on business logic to ensure schemes are mapped correctly and their inconsistencies rounded out. And in fact those inconsistencies may not be removable. It may be that two sets of order statuses simply can’t be reconciled, for whatever political or logical reason. In that case, the Realist can just take the easy way out and add fields in with those two separate schemes, perhaps prefixing the field names to indicate which source system they came from. Which is to say, the Realist can always fall back on the Nominalist approach. They may still need to do some kind of semantic normalization in a less rigid layer, likely in the reporting layer. They will still need to document and revisit these attempted reconciliations over time.

In science when a theory is proposed and tested against new data and modified to accommodate that data it eventually becomes too cumbersome and unwieldy to manage. The same thing happens in a Realist ODS. That’s not to say that a Nominalist ODS doesn’t have to be refactored eventually, but if fields are named correctly in a Nominalist ODS the documentation is somewhat obvious. In a Realist ODS that mapping of source to target needs to be described or at least strongly version controlled, perhaps via pipeline transformation mappings. It’s those mappings that define the ODS, after all. The worry isn’t pressing when it comes to mapping various quantity fields into ODS_ORDER.QUANTITY, for example. But in the case of complex payment terms schemes or order statuses, those mappings constitute the semantics of the system. As such those semantics need to be kept clean and orderly.

The author doesn’t take a particular position on the merits of the Realist or Nominalist approach to ODS development. In both cases a reporting layer will still need to be built. Whether the data management practitioner assigns normalization to the upserting pipelines, in the Realist approach, or the reporting layer in the Nominalist approach, the work is equally complex in a large and diverse ecosystem. When deciding which approach to take it may be simpler to consider, from a teamwork standpoint, with which members of the team lies the most business knowledge, political sway and technical expertise - in short, whether the data engineering people are more clever than the reporting people - and assign accordingly. Generally the Nominalist approach is quick, simple, and requires the least amount of messy business involvement and politicking. The Data Management team can develop an ODS and start providing value using clear logical steps that take very little time to execute. Often however the need for a semantic layer is paramount, because downstream consumers of the data need to be insulated from variation or vagary in the upstream systems, and in that case there is a clear need for a Realist approach. In practice, however, there are some scenarios that point the way to clear picks.

In a situation where an organization’s functions have individually standardized on a single platform, such as an organization with one ERP system, one CRM, one PLM, etc., or perhaps even multiple instances but one brand of each system, then the Nominalist approach makes a lot of sense. In such a case there is really only one set of denormalizations in each system: Denormalize order tables in the ERP system, and related tables in the CRM, and there really isn’t much more to the task. There may be mappings required between systems, but not between conceptions of order, for example, which is what’s required in an organization using two different brands of ERP. In such a happy scenario, which is surprisingly common, an ODS can generally be constructed very quickly using the Nominalist approach.

In a situation where there are multiple schemas for order, perhaps in an organization that’s grown by acquisition and not yet standardized on a single order-management system, then the Nominalist approach can still work if it’s possible to partition the source systems somehow in the target tables. For example, if the ODS is being piloted for Japanese users, then perhaps the ODS’s order subsystem can be limited to just the Japanese sales order-source.

“Partitioning” as a concept in the Nominalist approach deserves more comment. It may be obvious, but when denormalizing data from multiple subsystems into one giant table of e.g. orders, many of the fields in the giant table will be NULL because they will not have corresponding values in the sources. Suppose for example there are four sources of order data, aliased SRC1 through SRC4, with a million records each at the most granular level. When those sources are denormalized using a Nominalist approach the resulting table will have four million rows, or a million for each source combined into one table. Now if one order source contains a required field called SRC1.CUSTOMER_CREDIT_STATUS for storing data about the creditworthiness of a customer, then each of those million records in ODS_ORDER from that source will have a value for CUSTOMER_CREDIT_STATUS. But the other three million records, sourced from other systems, will have NULL values for ODS_ORDER.CUSTOMER_CREDIT_STATUS. While this fact alone doesn’t argue strongly for explicitly storing metadata about the source and disposition of each record (there’s no need to argue that point, because it’s obvious), it’s significantly easier to manage querying of this data with a rigorous metadata schema grafted onto the denormalized table. Implicit partitioning of the data will be plain: Any record with a value for CUSTOMER_CREDIT_STATUS comes from one source and not the others. But over time this partitioning will become less clear, and so the data management practitioner should ensure explicit partitioning is maintained early in the process.

One clear situation where explicit partitioning becomes valuable because implicit partitioning becomes less clear arises from a natural and to-be-hoped-for consequence of developing an ODS: Namely, that the people who use the ODS to represent data from their various partitioned sources begin to discover attributes available in other systems, not their own, that they’d like to have in their own system. Consider the four-million-row ODS_ORDER table: A million rows have a specific value for their CUSTOMER_CREDIT_STATUS field and three million don’t, but the people who originally managed those three million rows without an indication of customer creditworthiness decide they would, in fact, like something similar in their own systems. They are all working for one organization, after all, and best practices should be shared; if keeping creditworthiness ready-to-hand is a good idea, then everyone should use it. In that case it may be the system owners of those sources decide to add, in phases, a CUSTOMER_CREDIT_STATUS field to their systems, which will eventually flow into the ODS, which has a field, more or less, for that data. As likely, however, the Data Management team might be asked to derive something similar in reporting, using business logic ported over from the appropriate system. In both such cases, the sensible approach is to move the data originating in (or derived from) all four sources into a single CUSTOMER_CREDIT_STATUS field, tagged or “typed” appropriately so it’s possible to differentiate SRC1.CUSTOMER_CREDIT_STATUS from SRC2.CUSTOMER_CREDIT_STATUS.

Eventually, in the fullness of time, the Nominalist approach will give way to a Realist data model. The above example shows how: Over time, as the sources included in the Nominalist schema become familiar and the organization standardizes either a semantics or on a single system, it will be possible to map the values from one source into another, or map both into a clearly-defined set of attributes that mean the same thing across all of the sources. It’s possible then to argue that the Realist position doesn’t ever need to be pursued formally because it will happen naturally and organically over time; or alternatively, that it’s best to pursue the Realist option first, since that will intentionally speed the arrival of a consolidated semantics that will happen anyway. In practice the two approaches are not so much metaphysical positions about the nature of data management as they are heuristics that may be appropriate for one circumstance and not another.

The aforementioned six-step process for building an ODS was glossed pretty quickly. It’s now time to proceed through it in a little more detail:

Go find an order data model on the Internet.

This step can be accomplished in at least three ways. If the situation calls for a Realist approach, there are multiple sources for these schema: Adopt the model used by a leading database brand such as Oracle or Microsoft in their ERP offerings, for example, or use one of the pattern books mentioned in the section of this book entitled “Patterns of Persistence.” Finally, if a Nominalist approach fits the situation, there’s really no need to do more than use the order model already in the source system(s).

2. Denormalize it as liberally as possible by joining each leaf node to its parent node.

Denormalization is simply the process of joining together tables that are kept in third or more normal form. This can be accomplished by finding the top or “trunk” table in the relevant schema, and performing left-outer joins on each subsequent leaf node. In the case of orders, for example, there’s usually a HEADER table which contains data about the primary customer, their billing and shipping addresses, summaries of the quantity of items on the order and the number of lines, the expected ship dates and the order date, and similar data which is shared across all of the items on the order. The leaf node for the HEADER table will be a DETAIL or ITEM table, which contains data relevant to the specific products ordered by the agents stored on the HEADER table, including the quantity of the product requested, perhaps the service terms, the prices paid, and so on. There may also be a leaf node for the DETAIL table and in this case that leaf is almost always called ITEM, which may break the quantity of product ordered down into further subsets for whatever purpose.

In the case of customer data, to take another example, typically the data will be broken into separate subject areas. Core CUSTOMER data will consist of the usual name fields, date of birth if needed, various national identifiers, and possibly demographic data such as the agent’s marital status and gender. If the agent is an organization there may be fields for a human contact. Additional data might include address and other kinds of contact data such as email address, all of which may be stored in separate tables. There may be payment information such as credit-card numbers or invoice templates.

In the case of CUSTOMER data, which is more complex than the order, the data management practitioner needs to make two primary decisions. First, what data is needed in the ODS for reporting: Is address data necessary but payment terms can be ignored? It’s likely that there will be some calibration with the business necessary here; first-line users may not clamor for payment terms, but their internal customers need it eventually, and so it helps to include more than is necessary even if no one immediately has a need for it. (Yes, this violates the classic YAGNI principle; see the section entitled “Can it ever be Agile?”)

Second, the designer needs to decide how to handle data that has changed in the past, and whether that needs to be included in the record. For example, while a CUSTOMER may have a current bill-to and ship-to address, they may also have a recent or prior address, and this data may be stored in the address table as well. Does the ODS_CUSTOMER record need to include the bill-to address active prior to the current, active one? Probably not, but (a) it’s good to check with potential users and (b) the decision to get only the current one will still require some potentially fancy SQL. The determination that the ODS does need this historical data will dictate the structure of the ODS_CUSTOMER tables, however. If the current bill-to and ship-to are all that’s required, then the ODS_CUSTOMER record can include this data, denormalized from its sources into their own fields on the record, and as those values change the CUSTOMER record can be updated to reflect the latest values. If however the historical addresses also need to be transferred to the ODS there will of necessity be a many-to-one relationship between ADDRESS and CUSTOMER, and there will need to be a table structure to reflect that logic.

Each leaf node table will generally have its own primary key, and a foreign key to the table one level above. The DETAIL table will thus have a HEADER_ID pointing to the primary key of the HEADER table, as well as its own DETAIL_ID. The ITEM table will have an ITEM_ID, as well as foreign keys to DETAIL_ID and possibly HEADER_ID as well. The ODS’s denormalized order table should include these keys as well, to assist with auditing, for each level of denormalization, but obviously only one copy of each needs to be included.

These keys can be used as well to find additional leaf nodes that may not be apparent from the data dictionary or the source system’s model. Most modern databases maintain their internal system catalogs and machinery inside the database itself, often in a set of “master” tables and a “master” database that manages the configuration of each asset on the database server. Searching these master tables for tables with the fields HEADER_ID or ITEM_ID can surface potential foreign key relationships between the trunk table and other tables in the database. This can be particularly useful for subject areas like CUSTOMER, where there may be lots of formal tables, and lots of custom tables, and troubling diversity in naming conventions. Using the master tables to search for all the tables that contain the field CUSTOMER_ID will help the data management practitioner find tables that have a relationship with the core CUSTOMER record, and potentially fields that should be included in the ODS CUSTOMER table or at the very least in the ODS CUSTOMER subject area.

3. Add in any oddball fields you may have in your own order management systems.

Every system is customized, more or less. Customization takes three forms in the database schema:

Adding custom tables to manage data or relationships not envisioned in the standard schema.
Adding custom fields to standard tables because the fields already there aren’t quite right.
Changing or adding the values of standard fields to reflect a changed semantics in the application layer.

In each case, if the ODS’s order model is derived from e.g. a vendor’s website, customized fields or tables won’t be in the model. Similarly if the proposed ODS model is based on a published pattern from one of the data-model pattern books, or developed as a Realist idealization of orders, there will be fields, tables or relationships that have been added to the source system in a completely ad-hoc spirit. These may violate the beauty and completeness of the model the ODS designer might implement.

Those ad-hoc additions can occasion a lot of angst, and a lot of earnest (and time-intensive) soul searching or brain-wracking in an attempt to find a fit for the ad-hoc within the ideal conceptual schema proposed for the ODS. There are two ways to manage oddball fields.

First, the data management practitioner can initiate a sustained effort to determine the point of these fields and whether they can be assimilated into an existing set of fields or ignored entirely. Inevitably or at least almost always these fields, tables or relationships are in the database precisely because there’s some quirk to the business model or its processes that requires data be recorded in a certain way. In fact their very presence as customizations usually means the argument the data management practitioner is considering starting has already happened, and the ideal or original model lost.

So that suggests the second method for dealing with oddball fields: First, accept them, then (second) denormalize them along with the other fields in the surrounding table, keep an eye out for ways to collapse their semantics into the more perfect model, and third and most of all accept them.

4. Calculate the number of active orders in those systems and provision database space accordingly.

This process highlights the practical fact that, in metaphysical terms, the data management process simply rearranges records as they move from creation to destruction, from source to various targets. Absent an aggregation step, the record created, stored and updated in a source system could be identified through its separated pieces in an ODS, a Kimball warehouse, a data vault or a data lake. The rearrangement process occurs for a variety of reasons - simplification, optimization for querying, consolidation with other sources - but the amount of space taken up by the original record is subject to a linear transformation, if it changes at all. That is, if a record in a source system uses 10kb of space in the source system, and all of that record is transferred into an ODS, then the record will also use 10kb of space in the target system. This is simple composition. Of course the actual total space used may be somewhat less, especially if a denormalization removes multiple physical tables or if certain fields are left behind. Or the actual total space may be slightly more, because the platform the record’s components are moved into may not be optimized for those components. But the original associated space requirement is a good approximation of the amount of space required to store the data after rearrangement.

That means, in practice, that if a customer subject area uses 60Gb, orders 30Gb, products 50Gb and customer service contacts another 100Gb, the total amount of space required in the ODS which takes a Nominalist approach to its construction will be ~240Gb. The actual amount of space required will almost certainly be less, but to be on the safe side the data management practitioner should pad their estimate by ~25% extra, bringing the total estimated space required to about 300Gb.

This padding is done for a number of reasons. For one, it’s always a good idea to ask for more space than you need, at the very least in case something dramatic happens in the difference between this author’s experience of the world and the reader’s. As well it may be difficult to explain to cost-conscious managers that extra space is necessary for unexpected fluctuations in order volume or customer behavior. It’s also often difficult to make the detailed technical argument that many platforms simply require a certain amount of headspace over and above the data stored at rest, and their growth is not a linear function.

That estimate is for production data, as well. This is the data that will be queried by users. It’s a good idea to provision an equal space for staging, or the environment, layer or schema where data for the ODS will be prepared for loading. More detail on this process will be dealt with in the section on data engineering, but there are a couple of heuristics to keep in mind.

First, it’s very likely that components of the ODS will need to be completely reloaded multiple times, either due to changes in business logic, new fields added to the source system, or various errors on the part of the data management practitioner or their colleagues on the source-system teams. In such a scenario a lack of staging space can require far more creativity with space management than the data management practitioner may want to exercise. Unaffected data will continue to move through staging on its way to production, while gigantic blocks of incremental data are shuffled from one corner of staging to another, waiting for their opportunity. Those kinds of logistics can lead to mistakes in loading, where perhaps entire blocks get forgotten or mistakenly processed and left out of the production tables or worse impact the reporting. It is better, in general, to use the simplest change marks possible to determine what data needs to be extracted from the source systems, and complex windowing-functions for pipelines always require manual intervention and the error that entails. In a situation where the entire order table in the ODS needs to be reloaded, then, it’s best to do it all at once, instead of one year or one month at a time.

Now the source system owners may not allow such profligate access, in which case a window needs to be determined and executed. In general this does not add up to a better solution, and the source system owners may be approaching the problem irrationally; most database platforms have an easier time extracting an entire table, without a WHERE clause, than they do partitions. Extracting a year every night may lead to more failures to extract and more disruption to the source system than just extracting an entire decade at once. The data management practitioner will need to weigh what they know about the source-system platform against both the political cost of leaning on source system owners and delaying access to production data in the ODS.

Second, there will be a certain amount of data moved to stage for audit purposes that will never make it to production, but the cardinality of that data will be the same as a subset (and possibly all) of the production data. The records will not be as wide as production data, because audit should rely on counts of keys, natural or otherwise. But there will still be a lot of them. While it may be the case that a periodic process could archive or otherwise shift large blocks of data around on the weekend to permit the load of audit data, it is once again simpler and less error-prone to just permanently maintain the space necessary to load audit data, rather than scrounge for it.

5. Build an upsert operation against that new model from each source using whatever best practices are associated with your data engineering tool.

An “upsert” simply put is an operation that determines if staged records are new and should be inserted into a production table or are old and should simply update records already existing in the production table. There’s a lot of hand-waving in that statement, so it’s due for some unpacking. More time will be spent on upserts in the data engineering sections, but as with the prior section there are some considerations.

Upserts may be performed by off-the-shelf data engineering tools, using e.g. widgets that are dragged onto an IDE and configured, or at their most basic they may be performed using a simple combination of steps, often in SQL. The focus of an upsert is always on two specific data sets:

The set that is production data used by end-users or other downstream processes, which should have its own natural key(s).
The set that contains data extracted from various sources and consolidated into a staging set, that is intended to be production data, and may contain:

New instances of the natural key, e.g. new records with new values of HEADER_ID + ‘-’ + DETAIL_ID.
Old instances of the natural key where the record has been marked as changed, e.g. records with old values of HEADER_ID + ‘-’ + DETAIL_ID but where a status field in the record has changed, perhaps from NEW to BOOKED.

The goal of the upsert is to mark in staging those records intended to be inserted, and those intended to act as the source of an update. Records that are intended to be inserted can be identified easily: The value of their natural key (e.g. HEADER_ID + ‘-’ + DETAIL_ID) doesn’t already exist in the production table. In most variants of SQL this can be determined using the WHERE NOT EXISTS construction, and most off-the-shelf tools with configurable widgets will, at base, be using exactly this SQL. Records that are intended to act as the source of an update will be the complement of the “insert” set; that is, any staged record that has a natural key with values that are already in the production table are sources for the update. Once these records have been identified in some way, the update and insert operations can proceed. Generally that’s the order the upsert runs in, as well: First do the update, and then do the insert, so the update runs against a table with as few records as possible.

It is possible to get really persnickety about the update process, and there are circumstances where attention to detail is rewarded. For example, a data management practitioner may want to mark which fields have changed values between stage and production, only update those particular fields, and record which fields were updated in that particular operation. This can be accomplished using various hashing functions, and it has value in cases such as operational medical records where strict legal rules apply to change tracking for safety audits. Similarly the team could make the decision to delay updates to a non-critical time period, while not delaying inserts, which might be a good idea both because updates may lock users out of the table until they complete, and because updated data isn’t crucial for users, but new data is. A situation like that might arise when users continually access a queue of new data, for example, but “old” data or updates to old records are only necessary for periodic reporting purposes.

In general, however, there is no harm in the context of the ODS to just updating the entire record, at least as a first approximation for a strategy. Skip the careful determination of which field has changed. As long as the target table can include both the timestamp for the update to the local record, and as much as possible the timestamps for the update date(s) recorded in the source system(s) for each component of the record, there is no real need to be specific about which field has been changed.

6. Index the table for deduplication and a couple of common search strings.

There are generally two use cases ODS tables need to support. First, they need to allow the speedy incorporation of new data. Second, the use case that pays the bills is the search for particular records. When indexing production tables, then, the data management practitioner needs to balance the utility of indexes designed for either case.

In practical terms the deduplication use-case can be met by indexing the fields that make up the natural keys shared by the target production table and the stage table. This will permit a quick determination of which records in the stage table (or environment) need to be marked for insert, and which for update.

In the ideal situation the search use case will make use of the same natural key fields as the dedupe case, but the fields that identify a distinct record are also likely not the fields searched by users looking for particular records. Consider the case of the order: Users are likely looking for specific orders by ORDER_NO, orders placed by specific customers using CUSTOMER_NO or CUSTOMER_NAME, or orders for specific products using e.g. SKU. Indexing on the fields related to those searches looks promising. But while this is logical, the actual physical way data is stored usually won’t permit this redundancy. The smart data management practitioner will include ORDER_NO, SKU and CUSTOMER_NO in their order table, if those fields are available, but those are not likely the fields used as natural keys by the source tables. (Those _NO fields are proxies for fields on old paper-based order management systems that have survived into the modern era. They are handy representations of the data, particularly for customers, but they aren’t sufficient for the data management practitioner.) Source keys will include fields like HEADER_ID, DETAIL_ID, PRODUCT_ID and CUSTOMER_ID, which may undergird the credibility of ORDER_NO, SKU and CUSTOMER_NO as representations of natural keys, but which are the actual fields intended as keys in the source system. The _ID fields are what deduplication indexes will be built out of, while the _NO fields are what search indexes are made from.

Generally then the ODS designer will begin their indexing strategy with one index covering the entire natural key, configured so fields are accessed in the order in which they’re named in the deduplication or upsert query. An identical index should be placed on the staging table(s) that source the production table, and the dedupe or upsert query that connects the two tables should ensure they make maximum use of both indexes.

The second set of indexes should be placed on whatever fields or combinations of fields are likely to be used by users. These search indexes should be reviewed periodically for their utility; most modern database platforms will provide you with statistics on the usage of indexes as compared to raw table accesses, to help determine whether people are getting data without the use of the indexes, and some platforms will even make index recommendations based on table-access patterns. There are two sources of input that can be used to figure out which fields are best for search fields: First, any existing reports run for end-user groups will likely have some kind of search criteria; and second, given the ODS is meant to mimic and consolidate the structure(s) of the source system(s), the indexing of the transaction source(s) can help determine which indexes are needed in the ODS. Consider that transaction database indexes are often designed to provide their stakeholders with fast access to order status, for example, and it’s easy to see how a similar index on the corresponding ODS table(s) would allow data management practitioners to offload access from a busy collection system to a busy retrieval system, as was discussed earlier in the Patterns of Persistence section.

We started this discussion explaining the utility of the ODS, especially with respect to its more common companion, the Kimball warehouse. We proceeded through a discussion of the basic process of putting an ODS together, considering a couple of different methods one might use to model the source data. And we finally ended with some more detailed prescriptions. The ODS remains largely overlooked as a foundational component of a good data management program. It has a negative image as a result of both the bad PR of its proponents and the apparent silver-bullet simplicity of its competitor, which promises quick, painless, easy solutions to data management challenges that almost inevitably fail when applied to use cases they’re not intended for. The ODS is a simple and easy system to build with enormous value, and while for this author there is great beauty in a perfectly-tailored star schema, with its effortlessly derived snowflakes, the ODS is the business, and building one feels like a lasting and significant contribution of value to understanding what an organization is trying to do.

Semantic Debt

Monday, February 5, 2024

4.3 How and why to build an Operational Data Store

4.3 How and why to build an Operational Data Store

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse