Monday, March 18, 2024

2.2 Applications change over time

2.2 Applications change over time

We mentioned above that applications evolve. That is, over time an application will change the way it collects data, the kind of data it collects, and the way it stores that data.

Applications will stop evolving for a variety of reasons: Key developers move on, the organizational context becomes fixed or mature, or there’s a plan to replace the application with something new and shiny and no one wants to invest in change to the old application. Stable, happy applications that continue to generate data day-in-and-day-out are the best kind, like the proverbial server absent-mindedly shut up behind drywall that continues to faithfully serve out web pages without a complaint. But this is unfortunately a rarity, and most applications continue to evolve.


But the happy path, as they say, is not always the common path. Some non-evolving applications fall into the category of “burning platform,” which is as bad as it sounds. Imagine you’re on an exploding offshore oil-drilling rig, dozens of miles from land in 40-foot seas with the firefighting crew on fire and the people who know how to run the thing jumping into an ocean full of sharks and jellyfish, all while being strafed by merciless drone attack helicopters; that’s what it’s like working around a burning platform. While it’s certainly possible to extract data from such a system it’s best to cut your losses - get what you can, build in some defensive structure so that as the platform starts to fail the data you’ve extracted from it doesn’t taint your entire package of products, wall off the data so you can mark it as suspect, and above all make it clear to your management and everyone in the management chain above you that you’re doing your best but it’s a burning platform. In this you will usually have allies in the team managing the burning platform, and it may be that you can extend and strengthen that alliance by using your data management products to display the value of replacing the burning platform. There are rare occasions when the team managing the burning platform is not aware it’s on fire, maybe because they’re delusional or because their jobs depend on it or they’re simply being paid to look the other way. In those situations it’s best to highlight the difficulties, and find ways to re-prioritize so as to defer the inclusion of that data in your systems and products as long as possible.

But not every application is a burning platform, thank goodness. They will evolve, however. In general there’s three dimensions they can evolve in:

  1. Collecting more attributes about an existing entity.

  2. Collecting data about new entities.

  3. Collecting different data for existing attributes for existing entities. 

In case (1), we’re usually talking about straightforward expansions of application forms. You may start to collect EYE_COLOR for CUSTOMER, or additional CUSTOMER address fields. Sometimes these additional attributes require that your extracts modify their queries to include new fields. Other times new queries or extract processes will need to be created to accommodate new table structures. In general, however, for the first case, the target system will already contain the dimensional or subject-matter structures necessary to store the new attributes without structural changes, and if the system is abstract enough the change management process will just consist of adding new fields to existing tables.

In case (2), the application creates brand new kinds of data: For example, the application may start to collect data about PREFERENCE for a CUSTOMER, for example. Whereas before the application just stored core data about the CUSTOMER such as name, address and serial number, now it begins collecting data about how they’d like their user interface arranged (or perhaps whether it’s acceptable to communicate with them via email) in a PREFERENCE data structure. More than likely changes of this type will require both new extracts from the application creating the data into the target system, as well as some new data modeling in source and target. Generally, however, the relationships in the target system will remain intact, with the same cardinality. That is, one set of PREFERENCE attributes will correspond to a single CUSTOMER natural key, and so the target system’s semantics can simply be “snowflaked” or modified in the usual way into a Type N+1 dimension.

And finally, in case (3), we may change the way we collect data, or the data we collect. In the US, for example, it’s common to collect RACE as a field in the CUSTOMER entity. (This is not common elsewhere, and is part of the peculiar history of race in the US.) Even fifteen years ago RACE in the US was a single-valued field; now it’s generally a multi-valued field, which is to say that an individual CUSTOMER may identify with more than one RACE value. In an application that collects CUSTOMER.RACE, for example, the system will need to change to accommodate the fact that, while only one value for CUSTOMER.RACE was necessary in the year 2000, the target system must now store more than one. The source application may also have to accommodate validation rules on the storage, because there may be limits to the number of combinations or the content of the combinations, for example, and the limits of the rules may need to be reflected in the target system.

In each of these three cases, the application’s evolution may proceed according to a tightly-managed metrics-driven scope and timeline, or it may happen in the dead of night with no one watching. Sometimes the same organization will take both approaches.

But then why, if the source system is always changing, do we want to try to report on the data within it? Changing source data leads to the Discontinuity Problem, which arises when the data underlying the trend we’d previously tracked is no longer available, because the underlying data has changed so fundamentally that the reporting is no longer internally consistent. 

Consider a simple case: An organization used to report on RACE assuming every CUSTOMER had one and only one value. Every CUSTOMER was either WHITE, BLACK, ASIAN, HISPANIC, or NATIVE AMERICAN. That was it, and the notion of possible racial identity was restricted to those five categories, allowing for the possibility that some CUSTOMER records had a NULL RACE value, perhaps because of a validation failure. At some point the list likely expanded to include PACIFIC ISLANDER, so a bunch of people voluntarily changed their RACE values, and the target system needed to accommodate a dimension-type change. But because the categories were mutually exclusive, any aggregation of CUSTOMER on RACE would simply equal the total number of CUSTOMERs. But after the move to a modern assignment of multi-valued RACE fields, a person can now be in one or more categories. We personally know many people who would pick multiple categories, perhaps because their mother is WHITE and their father is ASIAN, or their mother is BLACK and their father identifies as NATIVE AMERICAN and WHITE. In the latter case a CUSTOMER may pick all three options.  

But now the previously nice tidy RACE reporting is all out of whack. Where before, when tracking the success of a marketing programs with respect to specific RACE values, the Reporting team might have noticed for example that while lots of CUSTOMER.RACE values were WHITE, and the number was growing handily month-over-month, the same could not be said for cases where CUSTOMER.RACE = HISPANIC, and that argued for spending more money to attract customers who identified as HISPANIC. When multi-value RACE fields are allowed, though, the categories are no longer mutually exclusive and the analysis has to determine whether the growth in any given category is from people who have expanded their applicable categories, are brand new CUSTOMERs, or who’ve switched. 

The biggest concern on the Reporting team is ensuring they don’t double-count people, but they’ve also quite likely thrown Marketing for a loop. Whereas before campaign managers may have had their job performance evaluated on the growth of CUSTOMERs in specific RACE categories, the Reporting team can’t measure the effectiveness of a campaign the way they used to. It may even be they can no longer reproduce the old results, which is to say that the data the team used to have about who was doing better at attracting NATIVE AMERICAN customers is no longer available, because a whole lot of the CUSTOMER.RACE records that used to say NATIVE AMERICAN now also include another RACE value. If there’s no record of which value was assigned first, it will be likely impossible to determine which assignment predated the switch to a multi-valued field. (That “impossibility” may seem an unnecessary and totally ad-hoc addition to the simple case, but in practical terms many applications are not designed initially to timestamp changes, usually until someone in Data Management complains.) 

So our very simple case has made it very difficult for the Reporting team’s friends in Marketing to get their end-of-year bonuses, and when they see each other in the cafeteria no one is particularly happy. The data management team could use the term “discontinuity” to describe the problem, explaining that the trend line was swallowed by a sinkhole they had no control over, but don’t expect sympathy. It’s not their bonus, after all, that was swallowed by the sinkhole.

Preventing the Discontinuity Problem is easy: Don’t take data into your system when you know it’s about to change, or could change.

But as every system evolves, that would mean not taking any data. For people disinclined to do data management, this is a powerful reason for avoiding the headache.

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts