Mopping Up the Data Puddles

Mopping Up the Data Puddles

Data puddles have to be one of the hardest problems to solve in the data strategy. Connecting or combining these puddles together to enable future analytics seems so obvious. Getting traction to move it forward feels like pushing water uphill. Why is this? To answer this we need to look at why the puddles exist in the first place.

Data Puddles

The data puddle problem is where data is split into small pools of data. Often these are spread around the organisation with analyst teams beside each puddle. The puddles have evolved to support the local team and the local use cases. When we try to solve use cases which cover many puddles we start to have a problem. These projects get bogged down in work to logically join and physically import data between the puddles. These projects then fail due to the high cost of working across the data puddle landscape.

The existing landscape of data puddles normally has been successfully delivered for the organisation. Local analysts have been able to deliver analytics to run the business. It is often the more advanced and integrated use cases that suffer. The problem is not just one of where the data is stored but also about it’s structure and knowledge management about the data itself. The challenge here is not in the technical solution but in the motivation and justification of change.

Data Catalogue

A data puddle is normally a small set of data centred around one or two data subjects. The dataset is small enough that one can quickly become an expert in the dataset. Knowledge can be shared verbally with the local team through a tribal knowledge approach. This is highly effective in bringing the small team together and getting work done.

From the enterprise perspective working across the data puddles is more difficult. We are no longer close to the tribal team for knowledge management. Exceptions in naming conventions start to trip us up as we try to work with data we are not familiar with. What we miss is the global data catalogue or data dictionary. A definition of the data items, usage patterns and data quality exceptions that need to be considered.

Building out the data catalogue is absolutely dependent on the local puddle teams’ knowledge. If not impossible it is certainly ineffective to build the catalogue without their significant help. Even this small step towards resolving the puddle problem will have a significant impact on the local team’s workload. If we follow through on resolving some of the naming exceptions then this increases the work for the local team, as their existing analytics suites will need to be updated.

Joining the Puddles

To enable the larger use cases the data needs to be easily joinable across the data puddles. That implies that the data is stored in the same storage technology or certainly a few technologies. That the data uses similar data structures and compatible keys for joining. This is beginning to sound like a migration to a data warehouse or data lake. Certainly, it is one option on how to resolve the problem.

The data can be left in distributed data pubbles providing consistency in technology and structures of the data is enforced. Enough to relieve the challenge of joining across the data puddles. Though this may not ever serve the highest real-time use cases, it would be enough to enable most use cases. It is also a stepping stone to merging the puddles, should that be needed as a future state. Once they are on similar technology, data structures and keying strategies this merging approach becomes much easier.

Once again there is going to be a considerable impact on the local analytics team as part of the resolution strategy. We may have new technology for them to learn. Changes to the data structure to implement in their analytics suites. New data structures to learn. All this with no benefit to the analytics they are already providing to the business.

Selling the Puddle Mop Up

The data strategy is selling the data future to the c-suite. From their enterprise-wide viewpoint, the value of connected data use cases and futures may be an easy sell. This becomes a high-priority imperative for the company. Surely the local data puddles team will get on board with this and help us push forward. In reality, I would expect significant pushback. There is not enough value in just helping the wider enterprise to truly motivate the teams.

What we need is the killer use cases that bring these teams together into the future journey. These may be use cases they already know about and have been unable to deliver on. It may be new opportunities their stakeholders are thinking about. How do these overlay with the data puddle landscape? Perhaps prioritisation of these use cases will deliver the priority and motivation to resolve the puddle landscape.

Summaries

The data puddles problems from some perspectives are not a problem. The local data teams around the puddles are a data success. The challenge here is local justification and motivation for change. The change may be significant in terms of technology, data structure, and maintaining the data catalogue. Sometimes the approach taken is a top-down imperative, but that rarely works. We need engagement and enthusiasm across all of the puddles to build the lake of the future and deliver on the future data use cases.

To learn more about Data Strategy, take a look at my course on Udemy “Getting Started on the Data Strategy”.

If you are interested in Data Strategy, IT Strategy, and IT Architecture, please do follow me on social media. Or get in touch.

LinkedInJonDurrant
GitHubJonDurrant
Instagram@durrantjon
Twitter@drjonea
WordPressDrJonEA
YouTube@Jon Durant

Leave a comment