What are the risks of drowning if I build my own data lake?
Every utility analytics conference you attend these days seems to have one or two sessions discussing data lakes. What is a data lake anyway? Wikipedia defines it as “a method of storing enterprise data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files used for various tasks including reporting, visualization, analytics and machine learning.”
Wait a minute! I heard this this story before, and it didn’t end well. Isn’t a data lake the same thing as data warehouses from twenty years ago, where the IT department told us, more like promised us, they could build us systems to house all the enterprise data in one place and the analyst could access it anytime to make cool operational dashboards for all sort of users— including for the Executive Team who would have the pulse of the organization on their desktops without impacting the production systems. Millions were spent, and I’ll leave it to the individual reader’s own experiences to gauge if the advertised value materialized.
So, what makes a data lake different? Well, for starters, we can now put a lake in a cloud. That’s certainly progress, right?
Joking aside, cloud environments have certainly lowered the cost of putting data all in one secure place—some would even argue that a cloud environment is more secure than the utility’s. Moreover, new virtualized high-powered environments, along with modern analytical and visualization tools like Python and R, make processing and reporting on all the data sources a whole lot easier and faster. But as with the data warehouse of old, some fundamental rules apply.
First. Data quality matters! Garbage data is garbage data, using low quality data in any analysis will yield misleading or erroneous results no matter if you’re doing descriptive, prescriptive or predictive analytics.
Second. Content and context matter. There’s a big difference between KW and kWh except when we are dealing with hourly data. Well, let me rethink that. That only holds true if the hourly KW value was derived from the kWh value. Sometimes, which is usually the case for transmission and generation data, the energy MWh reading will come from a revenue grade meter, but the demand MW reading will come from a SCADA system which is typically a spot reading at the top of the hour. Can you see the problem? This is even more problematic for voltage data. Is it the average voltage over a period, or something more meaningful, like the maximum or minimum over the same period where readings are derived from sub-second sampling. And you must remember you can’t add KVA.
Third. There’s a big difference between having a data lake where analysts are hammering away at the stored dataset to find neat, new meaningful insights and having analytical production system using the lake where the analytical results are being used in critical business and operational functions.
Fourth. Although the IT department’s involvement, the “How Team,” is paramount to successful formation and maintenance of the lake, they shouldn’t be taking the lead in defining the business value addressed with the new lake—that’s the primary role of the various business areas, i.e., the “What Team.” So the “What Team” shouldn’t be telling the “How Team” how to build the lake, and the “How Team” shouldn’t be telling the “What Team” what is important from a business value perspective. Some of you know exactly what I mean! For those who don’t, we should talk.
Last. Even though the new modern data mining tools will bring new insights, the fact remains that some relational structure is still paramount. Grouping customers into new meaningful clusters based on unstructured social media data is not all that useful if you don’t know where they are located spatially when trying to optimize a demand response program.
Basically, the rules of old still apply as they did in the data warehouse days when it comes to new data lakes and the cloud. You must have a business need defined first with a defined objective—the “what”—before you go down a path of spending time and lots of dollars building your new lake.
DNV GL’s team of analytics specialists and technical architects specialize in guiding our clients in deriving informed business and technical requirements for advanced analytical undertaking for both grid and customer based analytics. In addition, DNV GL, corporate-wide, is making a substantial investment into our own version of a data lake by providing a secure open platform—Veracity—for use by our consulting teams and clients. We will provide more details on “Veracity” in future blogs.