Semantic Debt: 4.2 The Data Lake

4.2 The Data Lake

In the last five years or so Data Lake projects have been sold as a game-changing new type of fauna in the data management ecosystem, a system that is easy- and fast-to-build while providing a lot of short- and long-term value. The Data Lake is not new, however, and solutions of this type have a long and checkered history in data management. They’re also the best way for a beginner Data Management practitioner to start building systems. They’re an educational opportunity to make a series of useful and forgivable mistakes that create a ton of long-term value for their owners. Data Lakes are the critical early stage in the evolution of the data management ecosystem. They provide a laboratory for developing the business logic that glues business processes together, and as such they’re key to all subsequent components of the data management ecosystem.

In this section we’ll lay out the context in which Data Lakes are built, and discuss some of the strategies used to build them. We’ll also walk through what we expect from a Data Lake, particularly the evolutionary function they play and the knowledge they create for us. Finally, we’ll talk about ways people end up wrecking their Data Lakes.

Data Lakes have a checkered history because creating a Data Lake is often an organization’s first formal attempt to get serious about data management. Often sponsors think it’s also the end-point. As a starting point, the Data Lake requires thinking beyond the chauvinistic boundaries of individual systems and the departmental workflows they support, to an understanding that there is in fact a flow of data through and via those processes. This is very positive. But it’s tempting for an organization to treat the Data Lake as an end-state instead of an evolutionary stage or a niche in an ecosystem, and so they come to rely on it to the exclusion of future development. And what happens to a real, non-virtual lake that everyone loves? People build houses all the way around the edge, drain the marshes not realizing the role those components play in the lake’s ecosystem, and dump all manner of garbage into the lake because it’s temporarily convenient, not realizing they’re impacting water quality. They fish the lake to within an inch of its life, and begin to demand artificial stocking. They may attempt to improve the lake with pavement or various concrete structures and waterslides. And over the course of all this history the lake itself often dies and becomes much less useful - or even downright harmful - to everyone involved. Similar problems affect Data Lakes.

The typical Data Lake generally grows out of a demand by a management team to report synthetically on subject matter data that overlaps in two (or more) separate systems with hard physical and logical boundaries between them. Management might ask how long does it take for sales calls to result in orders?, for example, where data on the progress on sales calls is stored in one system and orders in another. The key mechanical problem that needs to be resolved when answering such a question is merging a series of records about a sales call, with its status changes and formal dates, and a series of records about orders, with their own status changes and formal date fields. This connection is often pretty difficult to create, for software people, but is obviously critical to answering management’s question in the long term. A far easier problem to solve, and also the first step in constructing the Data Lake, is simply to copy those two different sets of records, with their own table and field names and likely even incompatible granularity, into a single logical space. Once copied, someone can write exploratory queries to formulate the joins.

This copying problem is easy to solve because there are a wide variety of data movement tools, even if many solutions used in the wild prove limited in scale. If the data engineering team is a Microsoft shop they may use Microsoft’s bundled tools to create a simple pipeline to copy the tables from the source systems into a target database. If the organization is lucky enough to have standardized on one brand of database, “Virtual db” or “Linked Server” types of connections that make a direct link between the tables in the source databases and the copied tables in the target are an option. Some application code in .Net or Java or Python could be written to mimic the pump-and-dump operations of packaged data engineering tools like an Airflow DAG. Even R or Microsoft Access can be used to copy data. More recent replication tools such as Fivetran, Airbyte, or Stitch can do the same work with a simple interface, a lot of control and no data engineers. Data may be exported in batches each night using the resources of the database’s administration application, so it can then be slurped up using the same mechanism at the target physical database. Or the solution may simply replicate tables automagically in one form or another into the target environment.

In every case the right plan is to copy all of the data in the source tables into the Data Lake, meaning all of the history, all of the records in the table. More sophisticated implementations should then attempt incremental daily or hourly extractions once the initial history copy is complete, so that individual copy operations are smaller. As there’s often a fair amount of exploratory work that needs to be done on the source data to determine a precise and reliable incremental mark, which may require the kind of debilitating queries that make production DBAs nervous, it’s always easier to just copy all of the data in the source tables.

This wholesale transfer of data will seem wasteful or inelegant to proficient and uncharitable software-oriented DBAs. It will also often be criticized by application owners, who will argue that the inefficiency of copying data wholesale from one location to another could be overcome if the organization just did all of its work in their application, instead of splitting functions. These criticisms arise in organizations conscious that they’re collecting data for eventual reporting, but who haven’t yet made the jump to a formal data management program. A lack of formal experience with the larger data management ecosystem means developers will take the easy way out where possible. The easy way out, unless the source tables contain many hundreds of millions of records, is (worst case) to just copy the entire thing, every time. At this stage in the evolution of an organization’s data management program, as mentioned above, the easy way is also the best way.

The easy way out means copying should default to isomorphic transfers between source and target tables. “Isomorphic” means avoiding any changes to the data. Transformations of the source data are impermissible in a Data Lake. Simple identity between sources and targets is what’s wanted, as it happens. The target tables in the Data Lake should mimic the source table structure, with the same field names, same grain, etc. Some conversion code might be required to deal with differences in database brands, such as incompatibilities between the datatypes of one brand of database versus another, or between collations or code pages. Code pages can be particularly difficult to deal with: The transition from ASCII to Unicode and the peculiarities of the various data movement tools will often lead our Data Lake developer to change all datatypes to some form of character data, for example, or to leave out entirely fields that can’t be easily or obviously converted. But otherwise, the target copy should look identical to the source.

Latencies may vary widely from source system to system. Generally, it’s the rare DBA who wants to make more work for themselves, and they will be loath to introduce organizational and cultural conflict. That usually means minimizing any potential impact on source systems. In this practice overly-cautious DBAs will make an easy alliance with the business owners of the source systems, who obviously don’t want to see performance in their applications suffer for their users so Jim in Accounting can have the absolute latest data. And as well unless a reliable incremental mark can be found in the source data, each load will necessarily be a complete reload, which takes time at both ends of the pipeline. In ideal cases or for some sources the latency might be very low, and this is usually when some kind of virtual environmental connection (e.g. “linked servers” or replication) can be made between source and target. In the ideal case, the source system publishes a reliable incremental mark, such as a LAST_UPDATE_DATE field. The latter is often required on systems that need to be audited, but that means it’s possible to copy small subsets of source records to the target. Each of these considerations means that target tables will have the latest data possible, for some value of “possible.”

We’ve twice alluded to the conflict between system owners and Data Lake boosters that can create an ongoing source of friction, so let’s just get the problem out in the open. System owners within an organization are partisans for their system. They want to see the use of their system expanded to more users, and as the local experts they can often point to functionality available in their system ostensibly better than that of competing systems within the organization. Remember our example use-case of the Data Lake: We have orders in one system, and sales calls in another, and we want to merge those two sets of records so we can do some analysis. Many system owners will simply argue that there’s no need for two systems to record these two processes, when everything could be done in their system. They also want to ensure the performance of the systems they own is more than adequate so their users stay satisfied, and that use by people not chipping in budget is bartered for political support. There’s nothing wrong with either of these considerations. Building a good application chock full of good data with loyal users that recognize its untapped potential is the goal of all software development, whether the product is data management systems or smartphone apps.

But that healthy loyalty doesn’t preclude the need for the organization to hedge its bets, architecturally speaking. Differences in the workflows implemented by different software packages help us drive better workflows, software and plans, after all. The company that uses the same application for everything may appear to be super-resilient and disciplined. In fact they’ve restricted their range of options dramatically, and they ought to introduce some chaos to their application architecture right away.

The primary reason for monolithic architectures is that maintaining multiple systems will lead to irreconcilable data about the same process. It is this concern, after all, that likely motivated Codd to propose the relational model. If CUSTOMER data is stored in only one place to prevent writes from multiple sources that might produce contradictory data, then why would a thoughtful organization run two separate systems creating CUSTOMER data? Two systems with CUSTOMER data in them would seem to create redundant, potentially conflicting and almost certainly corrupted versions of the CUSTOMER record.

But extending Codd’s insight - that there should be one and only one storage location for a field - from individual fields to the subject matter that is made up of those fields is what your philosophy professor would have called a Category error. We want to avoid the situation where say an ERP system writes datapoint X to its own CUSTOMER table for example, and a CRM writes the same datapoint X to its own CUSTOMER table. In that situation, which is the source of Codd’s insight, we could have the same CUSTOMER represented with EYE_COLOR=BLUE in ERP.CUSTOMER and EYE_COLOR=GREEN in CRM.CUSTOMER. In which case, from a reporting standpoint, we have different EYE_COLOR values and we don’t know which one is true. A clever developer might suggest timestamping the writes, and picking the latest value. This solution rides the fine line between clever and stupid, though, and it’s just better to write EYE_COLOR to a single location, and copy it, if necessary, to the second location. But we seem to be back to thinking one application is necessarily better than two.

But the case for two applications isn’t for two identical applications, it’s for two (or more) applications that, while they may share a subject matter (such as CUSTOMER) collect different fields. The two applications may overlap in their ability to write the same values into a subject matter, but the responsibility for the fields they collect about CUSTOMER is partitioned across the applications. In practical terms, our ERP writes certain fields in a CUSTOMER record stored in the ERP, and our CRM writes certain fields to a CUSTOMER record stored in the CRM, but they don’t both write the same fields. Our ERP may be the source of ADDRESS data, and our CRM the source of the EYE_COLOR and RACE fields, and we may copy the one set of fields into the other system to support various workflows. But EYE_COLOR values are not created by both the CRM system and the ERP system, with a wistful hope for eventual consistency.

The astute reader may still wonder how, if the CUSTOMER record is now spread across more than one system, the pieces can be put together into a coherent whole. Given the different grains and keys used in various systems, how does the organization come up with a single, unified CUSTOMER record? This digression started out discussing reasons why system owners might be opposed to Data Lakes. It is now time to discuss how Data Lakes work, and in the process explain why they’re so useful.

Picture, then, a Data Lake as a series of tables copied from a diverse set of source systems. The target table names should reflect the source system’s naming convention, and the conventions may be obvious with a little archeology. There’s an ERP_ORDER table from the ERP system, a CRM_CUSTOMER table from the CRM and an ERP_CUSTOMER table from the ERP system, with different grains across each table depending on the source grain. There’s a PLM_PRODUCT table from the ERP system, a PLM_PRODUCT table with product data from the PLM system, a SALESPERSON table from the CRM, and a number of other such tables more-or-less useful. Diligent Data Lake developers will copy in more tables than they need, using their judgment to appear proactive when really they’re just avoiding a bunch of rework. Less diligent system managers will have trouble remembering naming conventions or processes, and so their lakes will look more discontinuous. Data Lake Basics.jpg

How does the developer choose which tables go into the Data Lake?

A proactively constructed Data Lake intends to allow the reporting concepts used in each of the source systems to be grafted together in the Data Lake. In fact that’s the point of the Data Lake: There’s a report M in application A written using SQL script ά, and a report N in application B using SQL script β, and the goal is to somehow merge the data in M with the data we get from N. To make this easier to visualize, suppose report M is a list of recent orders by CUSTOMER, and report N is a list of recent salesperson contacts with CUSTOMERs. A combined report would tell us which orders to attribute to which salesperson. To start, the data engineering team copies the tables referenced respectively by scripts ά and β into our Data Lake. They’ve done this successfully when scripts ά and β get the same results in the Data Lake as are obtained in the source systems, minus whatever latency effects may have been introduced. So it’s the tables referenced in the joins in ά and β that inform which tables to copy into the Data Lake. This process may seem very simple, but that’s really all there is to the Data Lake.

But the tricky part hasn’t started. The tricky part isn’t rocket science, but resolving this issue is pretty much the key to making the Data Lake useful as an evolutionary step. More than likely it’s not clear how to connect the data produced in ά and β. That is, there’s a bunch of records outputted by ά - maybe it’s highly granular or transactional, the kind of thing the analysts in Finance love to get in the tens of thousands of rows because it’s “raw” - and a bunch of records outputted by β, rolled up for the busy people in sales. While they’re clearly conceptually connected it’s not obvious how they’re logically connected, or specifically what SQL might be written to connect them. There are of course some Data Lakes composed of systems where the corporate rollout proceeded with common sense, so there are for example shared keys or common naming conventions. These systems aren’t unicorns; the author has worked with some of them. But by and large they’re extremely rare, and for the most part, if data is transferred directly between source A to source B in real time there’s some translation or transformation process that takes place between A and B. That is, the data copied from A to support processes in B has been transformed in ways convenient to B and the software developers who did the copying, and not to the longer-term needs of cross-system reporting. In order for the Data Lake to support a merged report and satisfy the request, a determination must be made of how the transformation step in the pipeline between the two systems A and B works.

Sometimes this is really easy. In a nearly-ideal and not uncommon case, someone has documented the transformation logic in the pipeline that transfers order data to the sales system. That process may even happen in SQL, or some kind of linear programming-language implementation that can be unpacked into SQL with a little ingenuity and experimentation. Or it may be, for example, that system A’s CUSTOMER key is reflected in system B somewhere, perhaps in a field in B on a relevant record, or maybe as a concatenation of A’s fields. So for example, if in A the CUSTOMER_ID is “1001” then that key (in our example) is stored in B in the field CUSTOMER_ORDER, and the transformation consists of A.CUSTOMER_ID concatenated with A.ORDER_NO, so that B’s CUSTOMER_ORDER field stores the source data as “1001-ABC.” In the application code that bridges the transfer of data between systems A and B, we would expect to find an explicit concatenation operation somewhere.

In that kind of case, when a report needs to be written that merges data from systems A and B in the Data Lake, it is easy to write some SQL to connect the CUSTOMER data copied from A with the CUSTOMER data copied from B, by pulling apart the concatenated string. The SQL that becomes the foundation of the report is based on elements of ά and β, and should, for the sake of sanity, just be constructed with a lot of cut-and-pasting from ά and β, plus some additional preprocessing which will need to be done on one side of the join or the other, to ensure that the CUSTOMER data on the left side of the join slots correctly with the CUSTOMER data on the right side.

But things are generally not so simple, whether because the transformation logic between A and B isn’t documented, or doesn’t exist at all, or must be created by the practitioner from scratch. The worst case scenario, and also the most common, are joins where developers need to determine or even concoct natural keys for either side of the join. So it may be that we have A.CUSTOMER_ID, and in B no such thing. What we’ve got in B is some kind of national identifier like social security number or social insurance number, plus birthdate and first/middle/last names. These may or may not be scattered across multiple tables. Solving this harder case requires creating deterministic joins: A natural key in B that we can either duplicate in A, or that can be gotten close enough.

In the former case, suppose (for example) A.CUSTOMER_ID in tables copied from A and B.SSN + B.DOB + B.FIRST_NAME + B.LAST_NAME in those copied from B. We need to see if we’ve also got an A.SSN + A.DOB + A.FIRST_NAME + A.LAST_NAME, or the logically equivalent fields in A. Source A may refer to SSN as a PERSONA or NATIONAL_ID or some such thing, and DOB may be called BirthDate or DATEOFBIRTH or INCEPTION_DATE. (Those fields may also not yet be in the Data Lake, in which case they need to be added to the appropriate tables; this is generally why we copy everything at the beginning and avoid unnecessary parsimony.) When we create concatenated natural keys out of fields in A and B, we want to ensure we maintain the grain of those keys. It perhaps goes without saying that means the keys need to be the same strings at base, but an explicit explanation will help, as follows. If the candidate natural key to replace or at least simulate CUSTOMER_ID for A is:

B.SSN + B.DOB + B.FIRST_NAME + B.LAST_NAME

Then the candidate natural key in A should also be:

A.SSN + A.DOB + A.FIRST_NAME + A.LAST_NAME

Which is to say: in order for the natural keys to match, the strings that make it up should match. Once the natural key that determines A.CUSTOMER_ID is determined in system A, then we can use the components of that natural key in B as a proxy for B.CUSTOMER_ID, which doesn’t exist.

In practice simple field-concatenation is almost never enough. Strings have various local configurations, and date datatypes are particularly local. This is certainly true when A and B are stored in different database brands (e.g. SQL Server and MongoDB, for example) or one database is Unicode and the other is not. In such cases, which are virtually always, the developer constructing the key needs to write as much defensive SQL as they can, to ensure routine and predictable matches. That means at least the following are good practices for these constructed natural keys:

Convert all fields to the same kind of string, e.g. NVARCHAR(50). Pick a string size large enough to accommodate any of the source fields, and then pad that. Make sure all source fields are explicitly converted to that datatype, e.g. CONVERT(nvarchar(50),OBVIOUS_INTEGER), or implicit conversions may fail at inopportune moments.
Remove any special characters, such as /, -, ., etc. In most SQL dialects the relevant function is REPLACE.
Don’t concatenate without a delimiter, by adding e.g. ‘-’ in between each string when it’s concatenated, so e.g. A.SSN + ’-’ + A.DOB + ‘-’...
Trim all leading and trailing spaces.
Replace NULLs or zero-length strings (i.e. blanks) so they appear the same on both sides of the join. Every field should be wrapped with an ISNULL, NVAL, DECODE or equivalent function to ensure there are no empty strings.
Ensure dates are all in the same timezone, and modify the dates to a master timezone before converting to a VARCHAR if necessary.

Only once each side of the natural key join has been brought into compliance - and even then the above is the bare minimum you need to do - can the developer confidently analyze whether they’ve got something like a reliable match.

(The reader may be tempted to skip some of the defensive programming detailed above, trying a “just enough to make it work” approach. You’re obviously free to try that approach, but it’s also a good way to find yourself deep in a rathole late one Tuesday night, trying to figure out why there are some random N CUSTOMERs in B for every A.CUSTOMER_ID. The goal of this defensive programming is to stabilize the string values on both sides of the join. In practical terms, you want to make sure the strings in both systems have had all their idiosyncracies removed, so it’s easy to find matches. If any possible variation could creep into the strings that make up the join, then it will creep into the strings that make up the join. Thus the temptation to be “agile” because YAGNI can quickly bog down development for days and provide the extra added benefit of destroying a reporting team’s credibility with its customers, simply because you didn’t do a little diligence up front.)

Once a reasonably reliable natural key for both sides of the join has been developed, the practitioner can evaluate how the data might be merged. Some idea of the counts on both sides is critical, to start with - just a simple idea of how many total unique instances are on each side of the join, and how many instances of each type or category there are, can help inform whether the two sides of the join have been successfully merged, and that should be done before going too far down any one development path. For example, it may be that A has all CUSTOMERs, and B only has CUSTOMERs that have called to complain about a particular product. In such a case the cardinality of A is going to be different than that in B. If an inner join using the natural key on A and B results in a row count that is significantly less than expected, knowing roughly how many matches should be expected can point the way forward. (The way forward may simply be that the natural key results in correct counts.) Similarly, if counts determine there are CUSTOMERs in B that aren’t in A, and vice-versa, then the intersection of the two sets generated by the natural key join can be validated with some people (e.g. Business Analysts or process management people) who ought to know the answer.

The set that results from the merge is the foundation for the expected report, O. Often there will be some basis of comparison. For example, there may be a software developer somewhere who’s looked at data from both sources and merged them using linear programming in some thoroughly inefficient replication of a relational join, all in order to produce a report for a manager. Or the analysis used to determine candidate natural keys may already exist in a spreadsheet in the Finance department, used by someone’s VLOOKUP function to connect the two sources. It’s worth talking to people on various teams about these possibilities, both because there may already be a starting point that can save a lot of time in analysis, and second because if there are people who have some expectation of what counts are expected it’s best to know who those people are.

We mentioned earlier that this is the worst-case scenario, and also the most common, and it’s worth a moment’s review to understand why both are true. Data comes to rest in tables designed more-or-less to reflect the process automated by an application. Those tables are sometimes designed with a great deal of thought, paragons of good relational design. But usually they represent what a newbie object-oriented application developer thinks makes sense. Also the persistence mechanisms for those processes, spread over an organization’s applications, are very rarely designed by the same people in each application, and even more rarely across the total set. Often those data models don’t even originate within the same company. It is a major accomplishment for a dedicated application database development shop to maintain the consistency of their naming conventions and code formatting standards across all of their responsibilities, even when they develop everything in-house. It takes an enormous act of will to maintain the consistency of design patterns, cardinality assumptions and plain-old ontology across the organization. So what one database modeler thinks of as a “person” will not necessarily be the same as what another thinks is a “person.” Even when the metamodel is identical from one implementation to another - as in, for example, the implementation of the versatile PARTY model across multiple ERP systems - the logic used to dedupe records inserted into the tables will differ from one implementation to the next.

Consider, as a side-by-side comparison, the national identifiers and their uses in the US and Canada. In the US virtually everyone, citizen or not, gets a Social Security Number or SSN (or “soe-sh”, phonetically). You can’t get paid in the US without an SSN, and a thriving business exists for manufacturing fake SSNs so people in the country illegally can get paid. As a result, for database developers in the US the SSN functions as a natural artificial key in virtually every database table used to identify individuals. And because of the prevalence of SSNs and their identification as artificial keys identifying individuals, it’s very easy to use the SSN to identify individuals for e.g. credit risks, or for managing similar kinds of longitudinal behaviors.

Canada has an analog to the SSN, called the Social Insurance Number or SIN (also phonetically the “soe-sh”). However the SIN isn’t used the same way as its cousin in the US, and very few companies require that an individual be identified by their SIN. As a result Canadian identity checks concoct a natural key using various non-artificial (and often logically non-sufficient) attributes, such as name, address, phone numbers, places of employment, etc. What this means practically is that, for financial purposes at least (and often in lots of other areas), it’s relatively easy to identify the longitudinal behavior of a person in the US with a SSN, while people in Canada who might have had the same SIN from birth but multiple home addresses will have trouble getting credit.

Practitioners in the US, then, will tend to want to use the SSN as the primary or default natural key for individuating PERSON records. Practitioners in Canada face more complicated scenarios and generally spend a lot more time cleaning strings, attempting probabilistic or contingent matches, or they may just give up entirely. In a US Data Lake there will often be a solid well-managed nucleus of customer data and the financial data that supports, with SSNs sitting at the bottom of the definition of the PERSON, concentrated in several key tables with lots of matchable fields. In a Canadian Data Lake there may not even have been an attempt to define a PERSON, independently of any transformation that might occur as a person’s data moves from source system A to source system B. The definition of a PERSON in a Canadian Data Lake may depend on as little as voluntary matches by individuals at the application level, with all the variability that entails.

But what happens even in the US when one source system contains an SSN, which allows application developers a reasonably sane strategy for mastering a person, and another source system doesn’t contain the SSN? This may occur when, for example, a practitioner is trying to match CRM data with an SSN against order data with first/middle/last/mailing address strings. In this case the US practitioner, spoiled perhaps by Mr. Roosevelt’s solid artificial key generation process, has to engage in the same deduping, string-cleaning and guesswork that the Canadian does.

And so this worst-case scenario, which is also the most common, occurs because in most cases an organization implementing a Data Lake is actively using the Lake to discover how to connect process data across various collection points. They’re evolving a semantic layer out of the application databases by creating natural keys to join together different processes. Application managers in IT defer this task because it’s too hard, or “just semantics,” or perpetually lower on the backlog than performance improvements or incremental feature enhancements. But sooner or later the management team will want to know how many customers go in at point X and come out at point Y, and that will require connecting the dots from CUSTOMER as represented in system A with how CUSTOMER is represented in B.

In practical terms, then, it’s difficult to overestimate the long-term value of the Data Lake to the organization. A Data Lake is a laboratory for developing these connections between the data collected through an organization’s applications, and evolving those links into a semantic layer that matches the organization’s real-world environment. Each of these constructed natural key joins, and the business logic encapsulated in filters (i.e. built into WHERE clauses) so that both sides of the join makes sense, becomes a module that explains how these source systems relate. Those modules have utility outside of the Data Lake, not least in that they formalize or at least highlight assumptions built into the organization’s applications. A well-constructed Data Lake will contain a multitude of tables from various sources, and a series of extremely valuable reports on top of the Lake that merge data from those sources. It is in the logic underlying the reports where the real business value lies, because they reconstruct the relationships represented piecemeal by each application. But those SQL-encoded relationships are really the gold nuggets in the Data Lake development process, aside from the profitability and workflow improvements that may be realized when data from disparate systems is collocated and merged.

Those relationships are the key to building the other components in a data management system. Once they’ve been developed, constructing an ODS or a couple of MDM systems will go significantly faster. In fact an organization that hasn’t got the more-or-less formal modules of SQL underlying reports that merge data from multiple systems will need to create those modules anyway in the course of building an ODS. There’s a case to be made, then, from an enterprise architecture perspective, that in the absence of any formal experience with big or sophisticated data management systems, a company is better off building a Data Lake first to (a) discover the connections between the source systems they need to merge, and (b) provide some immediate, definable gain to the business in the form of merged-system reporting, learning the lessons (and making the mistakes) that need to be made as they learn to do business intelligence. Data Lakes can then create powerful reusable constructs at an important, beginner-level stage in the construction of long-term infrastructure, contributing both short-term value and long-term value.

If an organization has what it needs in a Data Lake, why not just stay with that type of system?

Aside from the fact that different pieces of the ecosystem have different functions not performed by a Data Lake, there’s two temptations that tend to follow along with the construction of a Data Lake.

The first temptation is to force some kind of consistent local naming convention on the entities in the Data Lake itself. One scenario in the author’s personal experience involved running a giant script, nightly, after the Lake had been populated, to change the original table- and field-names and references in the stored procedures, views and source-controlled scripts that pointed to those tables and fields, into properly spaced English words. Properly spaced English words require brackets for database purposes, of course, and as the script was already doing something more appropriate to word processing all the names also needed to be fully qualified: So SALES_ORDER became [Sales Order] and LDGRJRNLTRNS became [Ledger Journal Transaction], SALES_ORDER_ID became [Sales Order Identifier], and so on.

There’s a laudable instinct underlying this temptation, which goes something like “make it easy for the business and new employees to read both SQL and schema.” But it’s both quixotic and irritating in its preciousness. The business will not read either SQL or schema, and probably shouldn’t besides. New practitioners who can’t get the hang of a database schema shouldn’t last past their probationary period. Data Lakes are an organic attempt to discover the connections between source systems, and the blocks of business logic and syntactic voodoo necessary to create a genuine semantic layer are best represented as they are, in both their natural precision and their unintentional vagueness, so that actual lessons may be learned. Those source systems will encompass wildly diverse naming conventions, cardinalities, business logic and goals. The Data Lake isn’t designed to fix those problems, just represent them and hopefully spur the upstream application developers to learn.

Secondly, customers of the Data Lake will often end up asking the practitioner to organize source tables into a particular model that better suits their purposes. They may want a regular report out of the Lake pointed to a structure that contains only the latest orders in the SALES_ORDER domain, because storing all of the orders has led to an unbearable latency in their reporting. Alternatively they may demand the Data Lake store every state change every order goes through, so they can support analysis of drop-off in the order process. Or they want a master CUSTOMER table instead of a single CUSTOMER table instead of the Lake’s ERP_CUSTOMER and CRM_CUSTOMER. All of these requests are common and will often drop into the practitioner’s inbox on parallel timelines, seemingly out of nowhere. None is easy to do without a proper Data Lake, which intentionally reflects the source applications as exactly as possible. But all three use cases are better managed in separate logical systems.

The astute practitioner should take those requests and ask to fund an ODS or Analytical warehouse or MDM system, as the case may be. There will be a desire to simply modify the Data Lake to accommodate these new logically incompatible requirements, to save time and money. But the Lake should not be paved and turned into an entertainment center. The ODS needs its own logical space, as does an Analytical warehouse, and whether from a modeling, pipeline, reporting or physical indexing standpoint the Data Lake is simply not the best place for those concerns. The simplest counterargument to adapting the Data Lake into either system is that by making these requests, the organization has already evidenced the need for a new dedicated piece of the ecosystem. While perhaps prototypes can be tested in the Data Lake logical space, what’s really needed is a logical space dedicated to the proper solution so that it can be done properly. The practitioner should defend the integrity of the Data Lake, point to the new requests as evidence of a maturity the organization didn’t previously have, and highlight the learnings provided by the Data Lake and how much those will help in the creation of the new systems. The Data Lake is a golden goose, properly nurtured, and instead of killing it one should use its results to hatch more gold into the ecosystem, of different kinds.

Data Lakes thus serve to document the organization’s business and application logic, in addition to being really great reporting systems. Fixing problems in naming conventions is for another system, and the least of the problems a Data Management team has to worry about is making field-names more like natural language. While it may be tempting to fix change management and inconsistent naming conventions in the Data Lake, doing those fixes in the Data Lake, which is a laboratory environment, means you’ve made it more difficult to use the Data Lake for the long term, while removing much of its utility. There’s value in archeology, and burying the original under layers of ornamentation is unnecessary, wasteful, and unhelpful.

The Data Lake, then, is a good place to start for an organization and a new data management practitioner. An organization that hasn’t got much in the way of data management infrastructure, that needs to start producing reports that cross system- and process boundaries, is well advised to start copying tables from source systems into some reasonably large and robust empty database they can point report developers at. The business logic discovered, refined and productionalized there will be useful in the more sophisticated follow-on systems that develop, in formalizing interactions between source systems that had likely been manual, and getting everyone’s feet wet with reporting, change management, and data management best practices.

Semantic Debt

Monday, February 19, 2024

4.2 The Data Lake

4.2 The Data Lake

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse