The data lake – it’s a phrase that’s thrown around a lot right now, but is it just an empty buzzword, or does it actually bring real value?
Well, there’s certainly some misconceptions around the concept of data lakes. The biggest one of these is, “we can keep everything because storage is cheap”. This is the overarching idea behind data lakes.
However, if you don’t manage your data lake or have the right governance, skills and processes to get value out of it, it can be more bad news than benefit. But how can you avoid turning your data lake into a hindrance?
Let’s take a look at some points to be aware of when managing your data lake.
A note on the difference between data and information
In the following paragraphs I’ll be referring to data and information. There is a seemingly subtle, but extremely important difference between the two. To save you some internet searching, data is an information carrier. Usually data carries more than one piece of information. Keep in mind, people are ultimately interested in the information data carries, not data itself. Imagine a sensor emitting its current state every second. Potentially there are two pieces of information attached to this datum – its state, obviously, (which gains significance only when it changes) and the sheer existence of the datum which suggests the sensor is working.
The similarities between a data lake and an actual lake
If you think of a data lake as an actual lake, it'll help you understand some of the ideas around it.
Let’s lay out the landscape first. We have a nice pure lake (our data lake), with several creeks/streams (our data sources) feeding it with water (data). At this very moment, our lake is filled with pure mountain water so we can even see the bottom of it, leaving people (data scientists) observing its banks and tributaries (metadata and data sources) or diving directly into it (making analysis).
This parallel continues further. As with this beautiful mountain lake, only good swimmers and experienced divers (analysts and data professionals), should be allowed in. Their safety, and the preservation of the lake’s ecosystem, water quality and long-term sustainability are all very real concerns.
The same goes for a data lake. As beautiful and tempting as it is to dive into all that data, you need to understand how to interpret what you’re swimming in and refrain from polluting the data lake in any way to keep it sustainable. On top of that, there are regions of the lake where only authorized personnel are allowed access, due to further safety (security access) and privacy (anonymization, data encryption, etc.) regulations.
Take GDPR — a stringent set of regulations impacting any organization doing business in the EU.
With sky-high sanctions, GDPR imposes strict requirements for both governance of personal data and communicating transparency around storage and processes that control the data. Pouring such data into a lake means it can’t be a free-for-all, and there’s a need to restrict swimmers (users) from accessing whatever they want at any time. There may be restrictions around what data can be stored, in what format, and in what combinations. Ultimately, data owners need to be accountable for what they’re putting into the data lake, who is swimming in it, and for what purpose.
Where did your water come from?
Data lineage defines the source of your data, where it’s been and what’s happened to it on the way. As much as this could be a topic for a separate article, it's absolute important to know your creek’s origin and how exactly it got to the lake.
Does it flow past a chemical plant (has it been polluted on its journey)? How much mud (data carrying no or insignificant information) your creek brings into the lake is also a valuable indicator. This will determine how much and what kind of maintenance your data lake might need and if the creek is even trustworthy. (You’d probably drink from a mountain spring but not so much from water at a freight port.) If you allow too much garbage in your data lake, people will eventually stop using it or, even worse, ignore the muddy waters but fail to notice those seemingly crystal clean creeks that are polluted with invisible toxic chemicals.
Download the Guide to Enterprise Data ArchitectureMonitoring data quality
This leads on to another important point — monitoring the data flowing into the lake. This obviously helps as an early warning to protect the quality of your lake, but there’s a broader and deeper benefit.
Read more: Guide to Data Quality
Receiving corrupted or incomplete data could suggest something wrong with an upstream system. In a sense, this central hub should act as the health monitoring facility or internal standards authority that affects all connected systems.
We’ve seen instances where analysis of data loading process logs has helped to identify low-performing branches of a large organization, just by revealing the above average occurrence of errors in the data they were providing for the central hub. This “meta information” can be equally important and as transformative as the data itself.
In the worst case, you could end up with a data swamp (or shall we call it a “data dump”?) What is that?
Well, if your data lake is un-maintained, with no traces of where its contents have come from, with unreliable or absent controlling mechanisms, your lake will suffer pollution and you will be unable to navigate it. There’s a much higher chance of someone getting hurt or even drowning if the banks of your lake are uncultivated and left to the wilderness of nature. Thinking of a data lake as being just a place to put data, without plans and processes for proper treatment, might quickly waste your investment.
What’s in your data lake?
Assuming you’ve monitored the quality of your incoming water (or data), and you’ve fixed any problems, how do you go about retrieving the parts you want out again?
The big advantage of a data lake (as opposed to a data warehouse, for example) is that you don’t have to spend too much time upfront organizing and structuring your data. However, you do need to have something to organize it with, otherwise it’ll become a mess.
Even simple tagging of the data’s arrival day, time and source can be an enormous help. But, imagine what could happen if there was also more general information about the data’s content on top of this. The benefit of a data lake – being able to sift through what you have and find what you need – becomes much more achievable when some basic measures are put in place to track what you have in your lake.
Lifeguards on duty
Since a data lake contains just about all the data you have, it is also a great potential risk, and requires carefully defined rules to manage that risk.
Remember when we talked about GDPR? The level of data governance you need can vary depending on your circumstances. Going back to our water lake, sometimes we’d be fine just with a lifeguard ensuring the swimmers and visitors aren't doing anything they shouldn't be. Other times, a perimeter fence and guard dogs are required. The point is to fulfil legal and regulatory restrictions as well as internal guidelines.
What's more, security also relates to making sure people are considerate and respectful to each other. The lifeguard’s duty is to make sure one person does not bully others and hog the whole lake for themselves. Sometimes, the provisioning of a smaller pool with samples of water can be a way to deal with complex hypotheses requiring multiple iterations without affecting the work of other people.
Data Warehouses, Lakes, Hubs and Vaults explainedAre data lakes for everybody?
In a word – no. The ‘data lake’ is a great buzzword, but as we’ve seen, it’s not the magic bullet to solve all your data problems. You need to put the right processes, tools and governance in place to make the data lake work for you. And, crucially, you need to have the right skills to be an effective user of the data lake.
We're going to leave the lake analogy for a minute and focus more on a cooking analogy (apologies for this, you’ve probably already guessed this is a very analogy-heavy blog post). You might have all the ingredients and a book of recipes, but if you lack the skill or drive to be a good cook, you’re probably won't make a great meal.
Hoarding data is like stockpiling ingredients in your pantry. You’re not going to get a single meal if you don’t step up and, well, start cooking.
Can the ‘data lake’ be a useful concept?
Yes, the data lake may be a powerful concept. But only if the right effort is put in to get the results out. To continue our cooking analogy, if you really want to cook, and especially if you want to cook exotic, never-tried-before meals, a data lake is the way to go. It’s a giant pantry that allows you to store all sorts of ingredients to use in your experimental cooking. But if you’re NOT into cooking, hoarding ingredients won’t fill your bellies with delicious meals.
Articles which oversimplify the situation and use phrases like “keep data and figure out later” or “self-service and on-demand access” just deepen the problem; they are likeable but unrealistic. I became quite a fan of an article Are Data Lakes Fake News by Uli Bethke, challenging these and some other claims.
Still thinking about employing a data lake? Good. Just remember to keep track of your data, mind your data governance, catalogue data from day one, and keep people without enough skills away.
Above all, remember that a data lake is not a magical place where all data problems are going to be solved in the blink of an eye.
Read more about Data Architecture
Editor's note: This blog has since been updated in 2019 to help streamline the content and offer more value. Enjoy!