Semantic Debt: 3.6 Patterns of Organization

3.6 Patterns of organization

There are a couple of simple patterns you should consider when organizing your data management efforts for maximum impact. In this section we’ll discuss some of those patterns of organization. We’ll look at a couple of different organizational models, and point to some fundamental properties of those models that help make teams effective.

Earlier on we mentioned that many enterprise architects have come around to the view that the enterprise is best viewed as an information supply chain. Usually the inspiration is a diagram like something to the left. At one end of the enterprise are the suppliers of information, and at the other end are the consumers, and in the middle are a bunch of nodes and edges that convey the information flow from inputs into the organization to exports.

Architecturally the straightforward gloss on an “information supply chain” assigns applications the role of sources or “suppliers of data,” and internal or external customers the role of consumers or “users of data.” The intermediate edges and nodes in the enterprise are then assumed to perform some kind of value-add. In this concrete supply chain model, an example of which is to the left, the enterprise information supply chain is assumed to be the kind of thing you find when you’re making consumer goods. In the concrete model above, suppliers provide raw materials, factories assemble finished goods out of those raw materials, warehouses store the finished goods for distribution to outlets, where the goods are purchased by consumers. These concrete models look a bit like how a data management team assembles reports, but it doesn’t really match our understanding of data flows within an organization.

Supply chain models are typically reduced to four tiers. In a four-tiered model, manufacturers are responsible for finding and assembling raw materials into finished goods. Distributors transfer those finished goods to operators, who sell the product to consumers. Each tier plays a specific role by adding various discrete kinds of value at each step between the e.g. mining of raw iron and the consumer who picks up the finished good. Distributors may assemble or bundle manufactured product into something more useful for the operator, for example, and operators may do a lot of last-mile assembly or marketing to identify the appropriate consumer. This four-tiered model is how the author first came to understand supply chains, and it’s generally replicated across every industry, from automobile manufacturing to foodservice.

There are some obvious parallels, again, between the information supply chain in the enterprise and this four-tiered model. Again, as we suggested above, applications collect data from users, which is the raw material assembled into reports of various kinds using one of the systems in our data management ecosystem. In this sense, we could view our data management ecosystem diagram as something of a supply chain model, with value added from one end of the ecosystem to the other, where the arrows signify the “value-add” operations we see in a four-tiered institutional supply chain. When enterprise architects try to map a four-tier concrete supply chain model into an information supply chain, they often end up with some questions of correspondence. Do we assign our pipeline management system the role of distributor, even though it doesn’t appear on our ecosystem diagram, or maybe our visualization tool, which also isn’t there? Or perhaps the analytical warehouse itself is the distributor, because it acts as an aggregation point for a bunch of source systems that supply raw data? Or perhaps the visualization tools are the operator?

With these distinctly un-clarifying questions out in the open, let’s return to our simple model of information flow within an enterprise. Sure enough in our graph model there are points that correspond to the manufacturers in a four-tier model. Every enterprise has a number of applications that create data at the Input points. Some of those sources collect data from outside the enterprise, from customers, and some from inside the enterprise, from employees. And there are definitely consumers, on the right hand side of the diagram, in the output points. And in between, we see a number of nodes where data is used and edges where data is transferred between users, which more-or-less look like additional stages in a concrete supply chain.

The primary difference between the simple four-tiered concrete model and an actual information supply chain in the wild is the complexity of use. Supplement our simple model of how data flows through the enterprise with some detail about who adds value where, and we see something like the diagram to the left. Within the enterprise, information flows through specific teams, or what we’ll call an Area of Responsibility (or AOR). Examples of AORs include various subteams in Marketing, Finance, Order Management or Supply Chain. Each of those AORs combines data from multiple sources and makes use of it, usually creating a brand new set of insights that then get passed on further to the right of the diagram. Each team synthesizes data to manufacture insights for downstream teams, or at the very least synthesize raw materials they receive from upstream nodes into something used inside their AOR. For example, Order Management uses raw order data to make decisions that are then passed on downstream, to (at the very least) Finance and Supply Chain. When an order is managed by Order Management, someone or something is flipping order statuses, adjusting delivery times or addresses, or shifting inventory or product. This might be done manually in the case of large finished goods orders going to gigantic customers, or it might be done automatically for most orders at a lean ecommerce startup. But within the Order Management AOR, raw data from various sources is changed into something new, and this new product is used downstream by other AORs. Similarly, Finance combines data from a broad set of teams and makes something new, as do Marketing and Supply Chain.

This is the primary difference between an information supply chain and a four-tiered supply chain: Each team functions as its own manufacturer, distributor, operator and consumer, within their own AOR. Indeed, each team may have its very own data management ecosystem:

Their own source applications (e.g. Adobe Analytics, or the ERP order management module, or a CRM)
A set of analytical warehouses designed to analyze facts across time
A set of operational data stores designed to facilitate current decisions
Locally-scoped MDM systems mastering campaign or product data
Their own pipeline management system, data quality framework and visualization tools

And so on, all within the scope of the team’s AOR. Each team within the information supply chain necessarily creates its own data management ecosystem, and thus each team is its own four-tiered supply chain, with downstream teams treating the data they receive from upstream as more-or-less raw material for their own manufacturing operation.

The reality of this complexity comes home when we step back architecturally to consider how we might optimize these teams’ use of data. Certainly we can identify scenarios where we’d want clean and mastered CUSTOMER or PRODUCT data across each AOR. But excepting consistent customer, order or product data (or whatever other entities might be core to the business), for the sake of efficiency we simply wouldn’t want to force individual ecosystems into a global four-tier model. An analytical warehouse or ODS that works for Finance wouldn’t contain much of the data necessary for Marketing or Supply Chain, and Finance wouldn’t be content with the slices of the picture needed by Supply Chain. Nobody else might need to master CAMPAIGN but Marketing, for example. While Marketing may share use of mastered CUSTOMER, PRODUCT and ORDER data, that isn’t the only MDM system the Marketing AOR needs, even if nobody else cares whether CAMPAIGN is clean. For Marketing to perform optimally, to get the data they need to succeed, we want to avoid artificially restricting their ability to manufacture data products to just the data Finance approved and scoped, and vice-versa. Another example: Marketing needs a robust analytical warehouse aggregating customer interaction data, from website analytics to customer service calls (collected in a so-called Engagement model), to analyze the success of marketing efforts. No one else in the organization needs that kind of analysis, and will indeed tune out their Marketing colleagues when the subject comes up over lunch in the cafeteria, or even worse during budget discussions. If we viewed Finance or Order Management as the manufacturers of data, and Marketers as just consumers of Finance data products, we’d quickly discover Marketing was unable to operate.

That kind of false optimization, by the way, is one of the causes of the perception that a data warehouse project has failed. The scenario typically follows something like this track: Finance pays for an analytical data warehouse project on the premise that everyone will be able to use it, but the Finance people have no interest in the CAMPAIGN attributes Marketing needs to operate with, or the PRODUCT details Supply Chain needs, or the CUSTOMER data the Order Management team needs. As a result the Finance-oriented warehouse struggles to find uses outside the handful of people in the Finance team who need to forecast monthly revenue at the highest level. The organization as a whole inevitably takes a dim view of “data warehouse” projects simply because they’ve been scoped inappropriately. In such cases, the enterprise has tried to organize the nodes and edges of their information flow so they follow something like the four-tier Manufacturer-Distributor- Operator-Consumer model, by assuming the needs of consumers are the same across the entire enterprise, and that there’s a single manufacturer.

One clear pattern of organization, then, is that each team must have its own supply chain. Teams will get data from upstream, which we can stipulate is raw or more precisely raw for them. They make use of that raw material in their own idiosyncratic ways. We can’t force everyone into a single concrete four-tiered model and hope to succeed. We must allow each team to scope out their own model, and add value to their raw materials as they see fit.

There are opportunities to optimize an enterprise’s entire information supply chain, in at least two dimensions. First, we should provide a base of tools: A cloud database solution that each team can use freely, a common set of pipeline management tools, a consistent data quality framework, a common notification platform and data catalog, and a single visualization tool. We see the benefits of scale from providing consistent tools across each AOR. But within the AOR our expectations are agnostic about how and what each set of tools will be used for. We expect pipelines will be created using a consistent standard, for example, that it’s possible for data to flow from one AOR to another with a minimum of mismatch. But the content of the pipelines within an AOR are up to that team.

Our second opportunity requires a little more explanation. To properly use data within each AOR or each team within the enterprise, two specific functions need to be accounted for:

Movement
Integration

The “Movement” function is represented by the edges in the AOR’s information flow, and takes the form of copying records from source through to target. The “Integration” function integrates records from various sources into the nodes that allow people within those teams to make decisions. Integration may happen in an Excel spreadsheet and movement might be as simple as a manual copy-and-paste from two other spreadsheets. Or the two functions might be performed by complex replication tools copying data into a comprehensive enterprise data model that supports a robust suite of Tableau reports. But both functions need to be performed for that AOR to use its data correctly.

One common organizational pattern within an enterprise is to split these two functions into two separate teams. Often a given AOR, such as Finance or Marketing, will have an “Analytics” team that does the integration work, or the work of “analyzing” source data into something that can be consumed by end-users within the AOR. (Remember that what’s a source for an AOR may be an output of another AOR, so when we say someone is analyzing source data, “source” is contextual.) Analytics teams look at the records pasted into e.g. an Excel spreadsheet, and construct some sort of a proof to support a decision based on what they see there. The Order Management team, for example, may get reports via email that tell them which orders are in backorder so they can schedule communication tasks and take appropriate action. Before they can make those scheduling decisions, someone has to construct a report that contains the relevant datapoints required to make decisions. The Analytics function within the Order Management team does that integration work, looking at the raw data and arranging or remixing it to ensure their coworkers can make good decisions. The actual “analytics function” on the Order Management team might just be Order Management people specifying what they need on their report to be able to make a decision. Other teams may have people dedicated to performing those analytics functions, but more often than not, especially in data-lean organizations (which is most of them), the integration work is done by the people who need to use the data.

Necessarily there is also some other team, a Movement team, who make that raw data available in something like an automated fashion to the Integration function in the AOR. If for example the Order Management team’s analytics people need additional attributes in their report, they may need to go to a data engineering team to ask for those attributes to be made available.

There are lots of variations here. For smaller teams, one person will often do both the movement and integration for a given AOR, functioning as both the data analyst and the data engineer. That person will extract data from an AOR’s sources using some semi-manual tool and integrate the results into a report usable by their coworkers. Sometimes that one person is even an intern, hired to squeeze data out of the ecommerce engine using a desktop database tool, with the end goal of building a series of e.g. backorder reports.

But it’s common in many enterprises for a central data engineering team to do all the movement work at the behest of individual integration functions within an AOR. Often an organization will concentrate all the data engineering in a single team and centralize all the movement work, and not have any dedicated integration specialists, leaving the integration or “analysis” work up to line-level people in each AOR. Movement operations are centralized on the premise that, first, complex and expensive data movement toolsets are best managed by centralized experts with centralized cost control. This usually makes sense, as data engineering toolsets can be both expensive and hard to use, and so specialists in Python or Informatica or even just whatever cloud-compute-intensive tool is required are best tagged for the work. Second, there’s an assumption that the centralized data engineering team will be best capable of managing centralized business logic, a sort of implicit reversion to a four-tiered supply chain model. In such cases individual AORs pass their integration requirements - “show this product data pivoted against this order data for these customer types,” for example - into the centralized movement team to build them just the node they need to make a decision. The centralized movement team then theoretically applies centralized business rules to the request: These teams can have this data or that, these customer attributes or those campaign attributes, and so on. Now, this centralization is often enforced under the cover of privacy or data security concerns. When there is a concern for centralized business logic, there’s often a prejudice for a particular group’s conception of entity, and so IT, Finance or Marketing will end up tilting the business logic in their direction for every other AOR just because that’s who’s paying the bills.

A third reason for centralizing data engineering is the perception that integration is sort of last-mile, and the least complex or interesting problem in the ecosystem. Ironically this prejudice often ends up wrecking the goal of centralized business logic, even when this is a stated intention of the corporate data engineering group. In the author’s experience this view is prevalent when centralized data engineering is the norm in an enterprise. The result of this view is usually that each reporting solution is dependent on an individual pipeline, each with bespoke and inevitably varying business logic. These single-use solutions are usually brittle and idiosyncratic. When common terms, entities or deduplication criteria are necessary for the provision of good data to end-users, which is a necessity when entity data is recorded in more than one source, these one-off solutions from centralized data engineering teams often end up producing no more than accidental coordination.

The prevalent pattern that emerges then with centralized data engineering or movement functions and decentralized analysis or integration functions is one where solutions are narrowly optimized for specific end-users. Rarely will a centralized data engineering team dedicate time to the development of an AOR-specific data model or even a common data model, perceiving that work as some combination of “out of scope,” “too hard” and “unnecessary.” Each report relies on its own dedicated pipeline, supported by a single data engineer, with its own specific business logic and its own specific quality and presentation standards.

A centralized movement team can be really valuable if they’re coupled with a centralized integration team. There is tremendous value in a centralized group of data modelers who can aggregate logic for CUSTOMER, PRODUCT and ORDER across each AOR, and also contribute to internally-consistent views of AOR-specific data such as CAMPAIGN, STORE or INVENTORY (for example). But these teams are increasingly rare. Instead we have standalone pipelines, where the consistency of business logic for e.g. CUSTOMER deduplication or attribute management is up to the diligence of the individual pipeline developer and their informants in the integration function in the AOR. In other words, if the analytics or more often the business team isn’t aware they might be getting a view of their core entities that isn’t consistent with other AORs, they have no way of preventing conflicts from developing.

Again, this pattern of single-use brittle reports spec’d by technically unsophisticated business users and supported by standalone pipelines maintained by a single data engineer is so common it’s practically an IT standard. Nobody does data modeling, or if they do it’s only within the context of a single report or pipeline. Central data engineering organizations are loath to build centralized repositories with consistent data models. The net result is as you’d expect: Different CUSTOMER, ORDER and PRODUCT counts across the various AORs, different naming conventions for attributes, variations in quality and latency, and despite a consistent and centralized backlog for making requests and managing tool costs no consistent and centralized set of concepts where it counts, in the core business entities.

Earlier we made much of the fact that data is passed through one AOR to another, unlike a four-tiered supply chain model with a centralized manufacturing function. How do we square this view of the information supply chain, which we seem to favor, with the need for a centralized concept of CUSTOMER, ORDER and PRODUCT? The answer is our preferred organizational model for data management, which centralizes some of the movement and integration functions and decentralizes others.

An organization should optimize its tool cost by developing or purchasing consistent tools for data engineering, visualization, SQL compute, data quality and documentation and discovery. There should be a single set of tools, available to every AOR, with a single set of standards ideally accessible to beginner-level users across the enterprise. Whether the users are in Marketing, Finance or Supply Chain, they need access to an off-the-shelf set of tools that can be used to move, integrate and use data from their sources. This is actually the simplest problem to solve, however, even though it’s often the one enterprises stop at. In addition to centralizing movement tools, the organization needs to consider centralized integration of CUSTOMER, PRODUCT and ORDER to be as critical to the use of data as access to orchestration platforms, replication engines or visualization tools. Users in each AOR need the ability to find a pool of CUSTOMER data they can use for their own reporting, in addition to readily available licenses for e.g. Tableau or Nexla. Thus in addition to the centralized tools and semantic concepts, an enterprise needs people who can perform complex engineering and semantic layer-development tasks.

This view presupposes at minimum two layers of complexity in the enterprise. For the simplest problems, each AOR should have the movement and/or integration skills it needs to serve itself. Standardized tools and governed data should be made freely available to promote basic consistency in quality across the individual AORs. For more complex tasks - e.g. the building of an MDM within an AOR, perhaps, or other complex logic or engineering tasks beyond the skills of individual AORs - there needs to be a centralized team of specialists who can both pitch in with each AOR’s projects and maintain the centralized tools and semantic-layer necessary for consistency across the AOR. These two layers imply a centralized organization with data engineering and data modeling teams supporting a superset integration of key or foundational enterprise entities such as CUSTOMER, PRODUCT or ORDER, using data extracted across each AOR’s source systems and made available for use within each AOR. That centralized data management team is then also available to contribute to the development of MDM, analytical warehouse, ODS or data vault development within an AOR as well, as needed.

Our expectation is that an AOR will prioritize its own “analytical” or integration needs first, often within a perceived narrow scope. This organizational model - centralized data engineering and modeling supporting both centralized semantic layer development and AOR-specific semantic development as needed - allows the enterprise to optimize its tools and semantics. AORs will need to be experts in their own semantics, and ideally develop or at least participate in the development of their own semantic layers. But those AORs can’t be expected to optimize globally, to have access to the bigger picture on CUSTOMER, PRODUCT or ORDER semantics or to understand the global use-cases for replication engines or visualization tools. Structuring data management this way ensures a centralized team has the ability to optimize both for tool cost and semantic consistency, while supporting quality in individual AORs. And individual AORs can focus on optimizing their local situations, without having to spend money on trying to solve problems that are best solved at the global level.

We’ve discussed two patterns of organization common to enterprise data management in this section. First, data flows through an information supply chain from source systems to target users. Unlike a common four-tiered supply chain, however, the enterprise is made up of individual areas of responsibility that must each move and integrate their data to locally optimize for their own solutions. Second, organizations often try to optimize data movement centrally, leading to disorganized business logic or badly-optimized centralization. A data management team that can centralize a superset of key enterprise entities such as CUSTOMER, ORDER and PRODUCT while also providing both tools and technical support to individual areas of responsibility allows an enterprise to optimize its information supply chain.

Semantic Debt

Monday, April 8, 2024

3.6 Patterns of Organization

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse