Semantic Debt: 2.1 There are a lot of sources

One reason data management is hard is the sheer diversity of sources data management practitioners are expected to rationalize. Consider a typical Customer-360 program. The goal of a Customer-360 program is to collect all of the data in a given enterprise about a customer, organize it into a matrix of some sort, and make it available for eventual consumers throughout the organization. Those sources often include:

Customer relationship management (CRM) systems, which may record everything from orders and returns to call-center and customer-support interactions, as well as data about the customers themselves, such as addresses or eye-color.
Enterprise resource planning systems (ERP), which may record everything from orders and returns to call-center and customer-support interactions, as well as data about the customers themselves, such as addresses or eye-color. (No, that duplication with CRMs is not an error. ERPs will contain other sorts of data, but customer-specific data is often similar to what’s contained in an CRM system.)
Marketing systems, which will record attempts to communicate with customers via email, print or social media campaigns, and are often an early-warning system in that they collect data about customer behavior on an organization’s platforms such as websites.
Legacy or in-house order management systems, which may exist independently of ERP systems.
Backoffice reference data management systems, which are often built to supplement CRM systems. These applications may collect data about a user’s voluntary interactions with an organization’s products, such as use of forums or knowledge base data, or they may be designed to manage certain kinds of complaints. They may also record legal issues related to the customer’s use of a company’s products, such as copyright violations.
Existing data warehousing, operational data or reporting systems, which may have had data of various sorts manually added to them to assist in e.g. financial reporting, returns management, or some such enterprise activity.
Standalone applications of various kinds integrated into the customer’s experience and designed more-or-less to expedite a customer’s workflow, such as project management software, smartphone apps, or similar types of products, each of which generate data for the customer’s use or for other applications in the workflow.
Finally, tools that monitor server and infrastructure performance can provide a very useful understanding of a customer’s experience with an organization. They may also generate massive amounts of data.

Any given enterprise of a certain size and age will often have more than one of each source in use, often distributed by regions or through the units acquired to compose the current entity. Each will have been customized for various reasons, generally with insufficient documentation, and each is run by a harried internal IT organization that manages largely by word-of-mouth and implicit knowledge. Each data model will be different at base, independent of any customizations. Customizations will be pervasive.

As the data modeling team behind each source had different intentions or assumptions, the degrees of complexity involved in extracting and integrating meaningful data from each source will vary widely and wildly. There may be language differences, both human language differences across national boundaries (e.g. one ERP system in German for the European region and one in English for the North American region), and also in artificial languages, such as systems that can only be accessed with COBOL. Different brands of database will be used, each necessitating a specific approach to extraction. There will be rules about privacy that differ from one region to another. There will be issues of scale: Applications that track a user’s movement through a website will create on the orders of terabytes of data a day, while order management applications may generate only 50 new records a day. (They’re large orders, in that case.) And finally, each system, often controlled by an independent team, will approach security and access more or less differently.

So the data management practitioner has a lot of boxes and circles on their system diagrams, and within those system diagrams a lot of additional boxes and arrows describing how the data is actually arranged within each one.

On top of that is a very straightforward cultural diversity issue. Application developers and managers are of necessity focused laser-like on their particular responsibility. They will know the precise ins-and-outs of the data collected or created by their application, and often nurse grudges or defensive postures related to decisions made about the scope of function of the application. The developers themselves will often have different perspectives on the actual functioning of the application compared to the business analysts or architects who may also be consulted. But it’s no stretch to say that each application is an ecosystem containing at the very least the following niches: project managers to manage scope, business analysts to manage users, developers to manage delivery of functionality, architects to manage integration with larger goals, and operational people who keep the thing running. (It’s often best for the data management people if that’s all one person, but that can be a mixed blessing.) Add to that list the Users, who always seem to find a way to vault the barriers of defensive code and create data that violates all known laws of rationality or expectation.

Into that ecosystem steps the data management practitioner, who needs to extract the data created and collected by a given application, copy it somewhere, and understand why it was collected or created and the larger context it plays in the business. There will generally be resistance by responsible parties to the extraction process, because the extraction takes resources that will compromise the delicate balance of hardware needed to maintain reads and writes in a cost-effective way. There will be resistance to the idea of copying the application’s data into a general pool where it may be misunderstood or diminished in importance. There will be conflicts between niches or roles or just individuals within the ecosystem that have been ongoing for decades that the practitioner knows nothing about, but which appear to be about the data they need. And each application, being a system, is always evolving to meet the needs of the larger ecosystem it plays in. As a result it’s not the case, unfortunately, that if you’ve seen one SAP system you’ve seen them all. They’re all snowflakes, and each is an opportunity for each member of the team to resolve long-standing complaints and be creative with their day job. But that’s application developers for you: Each is a frustrated artist, looking to make their mark on the world.

Data managers will have to identify, understand and integrate all of these sources. Logically they may have to do all of them; practically they need to chart the ones they’ve identified on the axes easy-hard/low value-high value and work their way through the list. On top of managing the value provided by all these sources, though, they must also build their own system, often under significant time pressure from management teams eager to understand the signals they believe to be available in the environment.

So data management is hard because there are a lot of sources. But leaving aside the diversity of sources, bad data or expectations, it’s no wonder many of those aforementioned Customer-360 programs fail. It’s a wonder, given the challenges, that many more data management projects even get off the ground.

Add to that several additional issues, which we mentioned at the top. There’s often an expectation that the data management system will play an operational role as a data source itself, perhaps as a mastering system, for example, as the “single source of truth” on a customer’s current position and likely direction. And very likely the system is not being built for the enjoyment of the builders, so management and operations people alike will expect to use the data collected by the system to make their own more-or-less real-time decisions.

All of this, of course, is what leads people to believe that data management is hard. It isn’t necessarily any one particular thing related to data management that makes it hard, and it isn’t as if in the midst of all of the difficult local circumstances a data management practitioner’s coworkers won’t feel immense amounts of sympathy for them. But there often appears to be little incentive to get any of the environment right in a way that might make data management easier, whether it’s the documentation of application business logic or the putting aside of longstanding departmental rivalries to make system access easier. The result is often a kind of prisoner’s dilemma-type problem, where everyone recognizes that no one is cooperating with the data management team but no one wants to be first, because making data management easier seems to require doing their own job in ways that might make life harder for the individual team. Data management is hard, then, because there are a lot of sources.

Semantic Debt

Monday, February 26, 2024

2.1 There are a lot of sources

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse