4.6 Master Data Management
It’s a basic fact of the modern data ecosystem that critical data about critical entities will be duplicated and most importantly different simply because that data is created in different source systems. This happens in the simple case where a customer interacts with multiple applications, and each application creates its own CUSTOMER record. In the more complex case, two applications are downstream of a third, and copies of source records are sent to the downstream systems where they inevitably get updated or supplemented. Master data management or MDM is the process of creating entities and resynching records with the real-life thing. MDM is in one sense a brute-force solution to the governance problems caused by incompatible data models and inconsistent form validation. In another sense, MDM is the pragmatic connection between a data management ecosystem and the real world. And in a third sense, it’s one of those critical layers we see in good data management practice.
In this section we’ll discuss the process of mastering, which is basic to all pipeline development, and how that process eventually gets turned into the components of an MDM system. We’ll also walk through some of the use-cases, including the specific kinds of entities commonly managed in an MDM system. Finally, we’ll explain how MDMs should be integrated into the rest of their ecosystem.
MDM has a reputation as a complex, technically fraught and commercially lucrative operation. Many architects treat the process as a kind of voodoo that needs a sophisticated and big consulting team to attempt and lots of expensive software. In practice, however, mastering happens whenever we deduplicate the entities we find in source records. We do a kind of master-data management whenever we build CTEs or temporary tables integrating multiple sources of a domain or entity, as intermediate steps in our pipelines. There are different considerations that apply to each kind of record we aim to master. But there’s a little more to the practice of MDM than just thoughtful deduping, so in this section we’ll walk through the general principles of thoughtful deduping. If you’ve built a report that deduplicated CUSTOMER, PRODUCT or ORDER records sourced from more than one system, then you’ve performed the basic operations of MDM. In practice, a full-fledged MDM system is this process plus a few more simple-to-perform functions.
Any of the major entities shared across more than one system can be subject to the need for mastering. Most common are CUSTOMER records, and PRODUCT and ORDER are almost-as-common use-cases. Other domains include STORE, ACCOUNT, or even METRIC, but really any real-world entity where the data about that entity is stored in more than one system will eventually require some form of MDM. Fire trucks and fires, trees in an urban canopy, legal decisions, SNPs of DNA: The list of possibilities is endless. In each case mastering may need to proceed on either the individual entity or the hierarchy an individual entity belongs to, or both. Let’s take these two aspects in turn, focusing on CUSTOMER, PRODUCT and ORDER.
First, each distinct record of a CUSTOMER, PRODUCT or ORDER will of necessity get renamed or rekeyed as it’s copied or recreated across systems. In the simple case alluded to above, for example, someone goes to our ERP and enters their customer data when they create an order for a given product. Then when they call for help from customer service their CUSTOMER data is re-entered into the CRM system, which also contains a local representation of their ORDER and the PRODUCT they ordered. Each system expects different data: The ERP may expect legal names and the CRM familiar names, or BILLING_ADDRESS versus SHIPPING_ADDRESS, and thus records about the same person will necessarily contain different data. The CRM version of the PRODUCT may contain data about delivery times or user manuals, which may or may not be synched with the data in the ERP. Similarly the slice of the ORDER contained in the CRM will be different than what’s in the ERP. The CUSTOMER, PRODUCT and ORDER records in the ERP system are assigned keys, and the CRM records are also be assigned their own key, and the net result is that the person behind the record has been assigned two different identifiers within the two different databases.
(As a side note, there is a famous problem in Analytical Philosophy of Language concerning the planet Venus and the names “the morning star” and “the evening star,” famously discussed by Gottlob Frege. The planet Venus is the referent for the names “the morning star” and “the evening star,” which are the senses or meanings assigned to the referent. Language is filled with these problems: We mean different things about the same referent, and much of what happens in scientific language is this disentangling of sense and reference, to identify whether two things that we think mean different things are really about the same thing. You might think of the ERP representation of a CUSTOMER as “the evening star,” and the CRM representation as “the morning star,” and the real customer as Venus. Master-data management is then the process of identifying which referents belong to which senses.)
For the most part this duplication of customer data across systems doesn’t become a concern until it needs to be. When we count how many distinct customers there are in each system, or we need to search the ERP system for orders using data pulled from the CRM system, we discover that we have two representations of the same person or the same ORDER. In the reporting case we clearly know we have duplicates and we want to avoid double-counting customers, but we haven’t yet identified duplicate records. In the search case, perhaps the customer has provided their familiar name to the CRM, but there is no corresponding field on the ORDER in the ERP system where the legal name is used. Or maybe we have a waybill number for the delivery in the CRM that hasn’t been stored on the original order. We want good user experiences for either kind of customer, for the person who’s ordered and asked for help, or for the downstream report user who wants to know we aren’t double-counting when we count CUSTOMER. Our different representations, however, make that positive customer experience difficult to execute on.
You may recall from the Layers discussion in section 3.2 that we recommend the organization of entities after we do a basic deduplication of records from a source, and before we create a business process representation. In practice, we first collect ORDER or SERVICE_CALL records from our sources in a RAW layer, say by copying our ERP and CRM data into a Data Lake. Then we reorganize those records into their appropriate entities in an ENTITY layer. This may happen explicitly, when we create tables, or it may happen implicitly when we use CTEs or temporary tables as containers to perform an entity-organization function within our semantic pipelines. We perform this entity reorganization because we intend our pipeline to support, for example, a count of distinct customers. We know that there’s a basic CUSTOMER record we can derive from our ORDER data from the ERP system, and we know that there’s similarly a basic CUSTOMER record derivable from our SERVICE_CALL record, and we will need that CUSTOMER record to get our count. We also know for a fact there’s an overlap, simply because each SERVICE_CALL record is necessarily all and only about an ORDER record. The customer list is the same, in the real world.
Best and typical practice in our semantic pipeline to get to our deduplicated CUSTOMER count would - very explicitly - run as follows:
Create a temporary table or CTE for ORDER_CUSTOMER containing the customer-related fields in RAW.ERP_ORDER.
Create a temporary table or CTE for CRM_CUSTOMER containing the customer-related fields in RAW.CRM_SERVICE_CALL.
Deduplicate and insert the ORDER_CUSTOMER and CRM_CUSTOMER into a single CUSTOMER table, by matching against some constructed natural key.
Accommodate any special-case logic.
Count the results in the final table.
This semantic pipeline, as mentioned, might make use of a temporary table in an “operational” layer, used to materialize the *_CUSTOMER source records to make them easier to manipulate. Or it may use a SQL common-table expression or CTE. Either way the goal of steps 1-4 is to solidify a single CUSTOMER record that has been deduped across the two sources. We should note, as well, that steps 1-4 are identical - i.e. logically equivalent - to what we do when we create an ENTITY layer in a database for use by analysts.
A pipeline constructed using this practice makes it easier to add in additional sources of potential CUSTOMER records, e.g. in additional steps between (2) and (3), perhaps from additional ERP or CRM systems or some other source within the ecosystem. But it also gets to the heart of the mastering process, and steps (3) and (4) are really what MDM is about.
We expect that ORDER_CUSTOMER and CRM_CUSTOMER will have different artificial keys. ORDER_CUSTOMER will have its own locally-assigned primary key, as will CRM_CUSTOMER. Our pipeline will need to create a natural key out of the relevant ORDER_CUSTOMER and CRM_CUSTOMER fields to permit deduplication. Recall in our Data Lake discussion that we explicitly intended that system to identify those natural-key fields:
That is, we’ve used our Data Lake to determine how to master CUSTOMER records on the way to our report counting customers. This is step (3) above: We use the natural keys we’ve named in our two intermediate tables to create a third table with a deduped set of records.
An important point to highlight here is that our creation of a natural key is a semantic decision: We’re deciding what counts as a duplicate CUSTOMER record. We often find this kind of semantic decision to be a sticking point specifically when discussing the deduplication criterion in the construction of a RAW layer. In the case of raw source data, deciding what counts as a duplicate ERP_ORDER or CRM_SERVICE_CALL record requires constructing a natural key and deeming it sufficient to determine whether the meaning of a record is the same as any other record. In the case of the ENTITY layer, a similar but far more obviously semantic decision is made when we construct a permanent CUSTOMER table in our ENTITY layer, sourced from ERP_ORDER and CRM_SERVICE_CALL. In the case of the transition from our source systems to our RAW layer, we decide what makes an e.g. duplicate ERP_ORDER record. In the transition from our RAW layer to our ENTITY layer, we decide what makes a duplicate CUSTOMER record. Both are semantic decisions.
So much for straightforward deduplication using obvious natural keys, or step (3). There are of course complex cases, or step (4). If MDM is simply steps 1-4, it’s step (4) that introduces lucrative complexity to the mastering process. It will often be the case that simple syntactic logic - if the fields in this record match the fields in that record, the records are duplicates - is not a sufficient substitute for a semantic decision. Mastering is an excellent example of a Pareto-complete process, where the easy 80% takes up 20% of your time and the hard 20% takes 80% of the effort. It’s those cases where the straightforward syntactic deduplication of e.g. CUSTOMER records is insufficient to catch all the real customers, and that makes MDM appear to be a black art.
If MDM were as simple as syntactic matching on obvious natural keys then there probably wouldn’t be a need for actual MDM systems, and the standard and boring transition from RAW layer to ENTITY layer would be sufficient. But deduplication is often a long tail, composed of three main segments:
Automated straightforward syntactic deduplication, or step (3)
Automated corner-case syntactic deduplication, or step (4)
Manual semantic deduplication
Let’s consider each in turn.
We discussed the case of automated straightforward syntactic deduplication already: We concatenate SSN, DOB, FIRST_NAME and LAST_NAME and get a match or not, and thus deduce that two *_CUSTOMER records are the same. But we get complaints, let’s say, from cases not trapped by those examples. The most common example: A single individual customer has one value for FIRST_NAME in the CRM and another value for FIRST_NAME in the ERP system, perhaps as a result of having different familiar and legal names, or the introduction of straight up typos in the spelling of FIRST_NAME. As a result of the typos we have two records for that person in ENTITY.CUSTOMER when we should only have one. A significant number of these automated straightforward syntactic deduplication cases thus pass the SSN + DOB + FIRST_NAME + LAST_NAME test. So we need an additional deduplication test. Perhaps we add in fields from ADDRESS, or include MIDDLE_NAME or variations on FAMILIAR_NAME or LEGAL_NAME, and mix and match the fields in this second, more stringent test, to pop out the duplicates, expanding our natural key to stretch the meaning of “straightforward” as far as feasible.
This second kind of case eventually falls into the Automated corner-case syntactic deduplication segment of our long-tail. Since we find these cases more obviously in ADDRESS data, we’ll switch our examples from CUSTOMER to ADDRESS for ease of exposition. The requirements here will vary considerably from one ecosystem to another. Suppose what is most common: We have tight validation but a surplus of fields for users to enter their data in one system, and a small number of fields but lax validation in another system. For example, our ERP includes an ADDRESS_3 field, and the CRM doesn’t. Or our ERP address records expect apartment number (or similar subdivision, like Office Number) is included in the ADDRESS_1 field, while our CRM has an entirely separate field for APARTMENT_NO, and other sources validate the subdivision into the ADDRESS_2 field. Consider how we’d manage those: We need to create a deduplication test that accommodates cases where address fields include the subdivision keyword and address fields that explicitly call out APARTMENT_NO, and find a way to manage those fields, as in the following diagram:
In the case above we have three ways of representing the subdivision “Office,” including variations on the spelling of the keyword itself. If we were trying to find a way to accurately dedupe these ADDRESS records, which clearly all point to the same physical location, we’d need some relatively elaborate logic to identify variations of spellings of “Office” and the locations of the value we want to dedupe.
The tail segment on automated corner-case syntactic deduplications can be very long, and drive the diligent ENTITY-layer developer to distraction. Obviously some of what’s required is simple defensive SQL programming. Identifying all the possible ways for how and when to create a natural key to enable additional precision for deduplication is usually beyond the need of basic ENTITY-layer development. What’s wanted in most pipeline development practice is e.g. clean customer or address records with a certain level of certainty about deduplication, but not the ultra-clean records we expect to maintain high levels of customer experience. In other words, a count of distinct CUSTOMER records may not need to cover all of the possible cases that make a count unique. Data management practitioners organizing data for copy into an Analytics warehouse or ODS will often stop at a certain lower level of complexity, hoping the corner cases in the rest of the tail are not sufficiently numerous to really screw up the customer experience. It’s in these long-tail cases where the dedicated MDM system comes in, as an occasion and location for cataloging and implementing the corner-cases for syntactic deduplication. Where automated straightforward syntactic deduplication will commonly be used for the kind of straight-through pipelines that support most naive analytical reporting, a certain level of automated corner-case syntactic deduplication will be necessary for the kind of ENTITY layer creation that characterizes high-quality dimension table creation in an Analytics warehouse or table creation in the ODS.
The third kind of deduplication in our long tail is Manual semantic deduplication. A formal MDM system is marginally distinguished from a mere ENTITY-supporting pipeline by how many of the corner cases are automated. What really distinguishes the formal MDM system from the more expedient ENTITY-layer pipeline is the third kind of deduplication, which expects manual input. It may be that many of the corner-cases split hairs so fine they can’t be relied on to produce reliable results without their own elaborate logic. For example one string-cleaning operation for a specific deduplication formula may make another string-cleaning operation useless, and so stacking deduplication formulas suddenly requires a complex and unreliable order-of-operations to succeed. In such cases we’re really pushing the limit of automated syntactic decisions. We’re making the syntactic proxy for a semantic decision more complex than the scope of good pipeline development practice can support.
In many such syntactic corner cases it’s often just simpler to ask a human to make the decision. This is where manual semantic deduplication becomes an important component to the deduplication process, and as well the signature feature of a formal MDM system. In manual semantic deduplication we ask a human being to look at the records in question, at e.g. our three ADDRESS records in the above example, and decide which one is the “master” and which are just variations. This can be accomplished in the simplest terms possible, with a report on hard cases and a means for collecting feedback.
This report-and-feedback combination is, again, what distinguishes formal MDM systems from a pipeline between the RAW and ENTITY layers with hard-coded automated syntactic logic. The MDM system will formalize and automate as many of the straightforward and corner-case rules as feasible, and because we’re building a formal MDM system we can perhaps stuff even more corner cases than might be acceptable in a pipeline. But we also include a provision to report on records we think might be duplicates, that for whatever reason seem to be duplicates, and collect (hopefully automated) feedback from the user on whether they are duplicates. This manual decision is recorded and the master record designated as such.
A formal MDM system then is more than just the various logical formulae that encapsulate deduplication cases. It’s also more than just a mechanism to collect manual feedback on the differentiation of hard cases. A formal MDM system should also record the decision and the input records, so that over time a stock of variations on the master record is collected and can be used for future matches. While the roots of an MDM system are in the identification of natural keys for the purposes of constructing ENTITY records out of raw records, the MDM system formalizes and automates the informal, ad-hoc and often manual processes used to build pipelines to populate those ENTITY tables.
One final note about the construction of a formalized MDM system. There’s value in using machine learning to generate automated corner-case deduplication rules, particularly in the scenario where there’s lots of manual cases. When there’s a surplus of sources and application validation rules, it makes sense to get humans to dedupe the hardest cases and then train models using those results to take as many of those cases as possible off the screens. In this way we can pareto our way through the problem, by continually updating the kinds of cases we expect a human to decide, so the humans make those hard decisions, and teaching a robot to make decisions the humans have already made. We need to ensure we test these automated predictions regularly, and our ability to record the source of the decision becomes even more important when the time comes to roll-back a bunch of bad machine-made decisions. But there’s a lot of promise to machine learning in all but the most complex judgments of record mastering, and these kinds of automated probabilistic judgment are only getting cheaper. The diligent data management practitioner should ensure that any MDM system they build has a facility for training and automating corner-case deduplications, based on the human judgment feedback collected as a matter of course.
What kinds of mastering do we do with these processes? Above we mentioned some of the entities we typically master: CUSTOMER, PRODUCT and ORDER are common. STORE, ACCOUNT, PROJECT and METRIC are also common, although less so. Financial considerations may dictate mastering of various financial concepts, including charts of accounts or payment accounts. In practice any time data for an entity is collected across multiple sources, there will eventually be a need for mastering. Again, often the only mastering required is the standard and some of the corner-case automated deduplication accomplished in the construction of an ENTITY layer, or when consolidating and reorganizing attributes from source records to the things we’re reporting on. But when those entities become business critical in high-profile ways, such as when we’re worried about customer experience with reported metrics or personal data or marketing emails, or when we’ve got cost considerations, such as charges to multiple versions of the same project, then we need a more formal MDM system. We need to formalize the deduplication rules and the manual feedback processes, and record metadata about each decision.
Each of these MDM-eligible entities has two ways it might need to be mastered, as we mentioned early on in this section. In practice these two kinds of mastering don’t share deduplication rules, and so practically speaking two separate MDM systems will need to be built to facilitate both kinds of mastering for the same entity.
We’ve already used one example of the way records might need to be mastered: Two distinct records about the same individual thing, with slightly different values for the attributes. Our ADDRESS example above is an example of what we’ll call “individual entity” mastering. In the case of our ADDRESS example, we have a single location with three separate descriptions, and we need to define a canonical representation and assign the other two records as the subordinates. Individual mastering is what most people think of when they think of a mastering problem: There’s more than one definition of a reported metric, for example, or more than one representation of a single customer or store. These often arise simply because there are multiple source-system development teams, and so the source systems for the metrics, customers or stores leaves us with varying ways of representing the same data. In such cases we’re looking at simple natural key deduplication, even in the complex cases, and even in the cases that require human intervention and feedback. We look at the metric definition, or the range of store or customer attributes available to us, and decide that these two records are the same and this one is not.
The second kind of mastering is hierarchical. Consider that stores are often part of sales regions or brands, managed by hierarchies of store managers who report up through regional directors and thence to various regional VPs. As store hierarchies are recorded in different systems they get out of synch, and so the hierarchies need to be mastered and a given store assigned to its correct hierarchy.
Another example: A given product identifier belongs to a specific brand which may belong to one subsidiary in one region of the country but another in another region, but the product key may be transferred across various parts of the hierarchy over time until all of the possible locations for that key are completely out of synch. Another: A given metric may occur on multiple dashboards in different report suites, all with various names or slight variations in filter or logic. An extremely common example: A product is manufactured, sold and bundled into a product hierarchy or bill of materials by one division or brand of the company, but sold and renamed and rebundled in catalogs offered by other divisions of the same company, so that ultimately for every product sold in a given division there are N possible names or keys and M possible bundles that product might be sold under, all in the same logical company.
In each case we’re not concerned so much with the individual attributes of the individual entity instance, which may in fact be created in one single location. We may create all of our products in a single PLM tool, or source our metrics with single pipelines and tables stored in rock-solid code repos. It’s the renaming and placement of that well-constructed and possibly-even individually mastered entity record within a hierarchy that’s at issue. The store, in other words, has one and only one ADDRESS record, but who’s ultimately responsible for the store’s output is anyone’s guess.
Hierarchy mastering tends to happen at less scale than CUSTOMER or PRODUCT mastering. There may be thousands of STOREs and their hierarchies a mess, but their location within the hierarchy will often be a one-time fix. The difference in scale doesn’t necessarily mean the project is simple, by the way, just that there’s for example a couple of orders of magnitude difference between say the 100,000 instances of mismatched hierarchies and the millions of customers who are embedded in those hierarchies. MDM projects that work on hierarchy normalization, particularly the place of a customer within a hierarchy - e.g. where a given subdivision of a large company currently reports into - will require a cleanup effort, where the hierarchies are pruned and reset. As part of that reset effort the people doing the cleanup should ensure subscribers to the new source are continuously updated and tested for compliance. These additional compliance requirements should be spelled out at the beginning of the project. Those requirements come down to these elements:
Each source for the new hierarchy master should be formally identified.
Subscribers to the new hierarchy master should be formally identified.
Output formats to various subscribers should be formalized as data contracts.
Compliance should be ensured on some periodic basis, perhaps with contracts or some similar enforcement mechanism.
The last step, ensuring compliance, is critical. Without compliance, it’s only a matter of time before another hierarchy mastering project becomes inevitable.
Both individual and hierarchical mastering MDM projects create new sources of truth. Perhaps it goes without saying - but we’re going to say it anyway - that when an MDM project for a given entity has been completed all the systems in the ecosystem that need that entity should be considered downstream of the MDM system: Dimensions in the Analytical warehouses, entity tables in the ODS, hubs in the Data Vaults, lookups in source systems. Once you’ve implemented a customer MDM system your CUSTOMER dimensions in your Analytical warehouses, CUSTOMER tables in your ODS and indeed even CUSTOMER tables in source systems should be sending or getting new customer records from the MDM system and/or updating based on the results of any mastered CUSTOMER records. The point of the MDM system is to replace or at least solidify the ENTITY layer in any given system, to ensure that pipelines transferring data from RAW to the ENTITY layer are making use of the right entities. Correctly adopting the use of an MDM system for a portion of an ENTITY layer means as well adapting your metadata schema to identify which records have been mastered, which are in the process of being mastered and which haven’t been mastered yet.
This integration calls for an ecosystem-wide recognition of everywhere the mastered domain overlaps. It would of course be unusual for a formal MDM system to start in an ecosystem that didn’t have some form of governance and architectural integrity, just as a matter of maturity. MDM systems arise of necessity because an ecosystem is complex, and someone in authority decides that leaving ENTITY mastering for e.g. CUSTOMER to the whims of a single data management practitioner building pipelines on a random Tuesday won’t cut it. At that point, someone has probably tried to quantify the consequences of bad customer data or irreconcilable product hierarchies. Perhaps there’s been a flurry of embarrassing bad customer experiences, a lot of financial reporting problems or some high-profile conflicts over reported metrics. But the beginning of the MDM project requires identifying sources, and planning for the re-integration of the mastered data back into those sources.
There’s a temptation to think that you can build a big loop of mastered data, that all you really need to do is copy mastered data from the MDM system directly into the source and then good data will flow into the rest of the ecosystem. Loops are rarely this tightly closed, and it makes good defensive sense to continually test that your Analytics warehouses, ODS systems, Data Vault hubs and source systems subscribe to your MDM systems. Certainly source systems, like your CRM, ERP and other application databases, need the latest mastered data, to ensure customers, for example, aren’t looking for multiple copies of themselves. But the ODS should test its mastered domains as well, as should dimension tables in the Analytics warehouses. It makes sense to shorten the loops here, and make each system both a contributor and subscriber to the MDM systems. This tightened network of links between locations of mastered data and the source of truth for mastered data provides a quality check on the ecosystem as a whole.
We’ve walked through how to think through an MDM system in this section. We’ve discussed how the MDM system evolves from basic good layer practices into a formal effort, including the various cases an MDM system solves that aren’t solved in basic ENTITY layer management. We discussed the difference between individual and hierarchical mastering, and a little about how MDMs should be integrated with the rest of the ecosystem. The author began his career in data management doing multiple kinds of MDM, and is astonished at the mystique that’s grown up around a process that used to be done at scale in spreadsheets. We’ve certainly come a long way. Mastered data makes the difference in customer experience both outside the organization, when people see you’ve organized your own diverse ways of collecting data about them, and inside the company, with executives and coworkers who don’t have to deal with multiple representations. MDM is well worth the effort, and simpler than it looks. A solid MDM strategy belongs in every data management ecosystem.
No comments:
Post a Comment