Tuesday, June 5, 2018

Welcome to the beginning of the film


 By Illustrator unknown [Public domain], via Wikimedia Commons

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversation:

Me: Can you send me a copy of your data model? 
Co-worker: Well we don't really have a data model.  I mean, there was some design put into place a long time ago but I think that guy is long gone. 
Me: OK do you have anything like a schema diagram? 
Co-worker: Well not really, no.  Do you know how to print out a data dictionary?  You can talk to [long-tenured employee, let's call him Doug] and he'd be able to explain why things look the way they do.  I mean, the data is pretty bad. 
Me: OK where does Doug sit? 
Co-worker: Well Doug isn't really in this department anymore...
Me: Alright so then can you send me a list of reports? 
Co-worker: Oh that's even worse than the data lake... 
And so on.  Everyone has bad data, and its disorganized.  They're not sure why their data management systems look the way they do and the guy who does have some idea is really apologetic.  He also doesn't want to own it, because who wants to own a complete mess they didn't create?  The systems that move data from one place to another are a variety of procs, scripts and toolsets, but even if some of the code is in a source control system its arranged in any number of ways.  There have been more than a few developers and their variously creative approaches to ETL or pipeline development or ingestion or loading or parsing or whatever term the local vernacular has settled on to describe moving data from one place to the next.  Each style leaves its own mark on the overall system.

The organization reports on that data using a number of different tools, each of which point at a number of different sources, sometimes overlapping and sometimes with obvious redundancy.  Some of those reports are critical to running the business and some of them are suspect and if they do know which are which they don't know anything about the underlying layers.  There may be visualization people in the organization, particularly if its a more modern internet company, but nobody on the BI team really thinks much about the usability of their reports.

There may be data scientists or analysts - again depending on the local vernacular - and they may or may not be familiar with statistics, SQL, relational modeling and the basic principles of good data management.  Some of them may be R experts, or really good at Python, but they're often subject-matter experts in the organization's line-of-business and just obsessed enough to know how things should work.  More often than not they're overscheduled, often on projects with only a dim connection to the health of the organization.  Almost all of them take a lot of pride in their work, but their biggest complain is that once they consider provenance questions when it comes to their data, they have very little to work with.  In other words, if they start paying attention to uneven quality, auditability and consistency, they would usually end up with no data to science.  I suppose there are data scientists and analysts who go to work each day to plentiful data that's ready to use, but I haven't met any.

None of the organizations I've worked in have lacked data, but they've all complained about its quality.  They all know they need to use it and that if we could make it easy to use their work lives would be measurably improved and their organization would be better off.  I know that seems almost hyperbolic and embarrassingly naive and enthusiastic, but its true.

And the complaints have generally been of the "parts of an elephant" variety.  Which is to say, they identify the major data quality problem as the one they're working on, arguing for priority with people who identify the problem as the one they're working on.  No one realizes that there's an elephant in the room.

I've started this blog to work through how to discover the shape of the elephant, and how to keep it healthy and happy.  I've got a lot of opinions.  I plan to talk through those opinions and hopefully irritate some of my friends enough that they'll be provoked to provide their opinions as well.       

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts