Conquering the data wilderness: practical considerations and solutions
At the 2016 American Council for an Energy-Efficient Economy’s Summer Study in Asilomar, I was part of an informal afternoon group session for… let’s call us “data enthusiasts” … around the challenges teams face trying to leverage data to improve customer experiences and utility insights. One of the utility participants stood up and made a statement that really stuck with me to describe what they were facing – both within and external to their own data systems: “trying to make connections in our data jungle is next to impossible sometimes!”
Nearly three years later, I am really excited to share an update – both through this blog, as well as part of a panel at the 2019 International Energy Program Evaluation Conference, happening August 19th – 22nd in Denver, CO – as to some recent experiences in the wild from the “data jungle” of Massachusetts and how the MA Program Administrators and their stakeholders have found new and creative ways to integrate and leverage data sets
The first, and maybe the most important take away is that the “data jungle” does not have to be daunting; it also does not have to be complex, expensive, or out of reach for your team. In fact, you probably already have much of the data you need to make great inroads. In Massachusetts the core element we leveraged to integrate data and enhance analytical insights and customer experiences is geography. Geography is a fantastic unifier because the data we collect almost always can be tied back to a geographic space at some level. Utility billing data has an account number that ties back to a customer database that has information on where the customer lives; a totally different database capturing environmental compliance and emissions information may have no other shared information with the utility data other than… those emissions are occurring at a geographic place on the globe, and with that insight we can co-locate the two datasets and the geography becomes the basis of a shared key!
Even non-point data (e.g. those that do not have a specific address), for example weather patterns or American Community Survey neighbourhood data, can be assigned based on the location data from one file to individuals, or those individual points can be aggregated up to the level of the neighbourhood for analysis respective. Neighbourhood data is powerful in its own right even when we cannot assign a value to a specific location – for example, utilizing the billing data patterns and the demographic characteristics of different neighbours can help you better understand how and where electric vehicle growth might occur, and this in turn allows you to start proactively assessing policy, infrastructure, and programmatic design questions so that when programs are deployed the customer experience is a positive one!
“But wait” you say, “what do we do when the geographic relationship is not exact and there can be many different customers and data within a single address?” A classic example is a high-rise mixed-use building. Compounding this dynamic is that it is also feasible that not all datasets have the same geographic data in them… perhaps the utility data has the address on the side street where the meter is located, while another source like the tax data has the public entrance on a totally different street. Very true; geography is sill our friend here… but this particular part of the data jungle requires some additional tools. In MA, we have had really good success laying on phonetic logic models with the geographic data. These models look for patterns and similarities in text strings to identify and score possible matches – they do not require that text fields, notoriously unstandardized, match perfectly between different sources because experience and human nature has shown that will rarely be the case! The downside of tools like these though are the are resource and process intensive to run on large datasets and they are prone to identifying false positive matches in the absence of increasingly complex or restrictive logic which adds more time and cost. By leveraging the geographic relationships in the set though – specifically in MA we set up a search radius around the target address and indicated anyone within the specified distance should be considered part of the potential match set – we can split one large dataset into millions of smaller subsets that can be iteratively looped though and matched only with candidates that both sound similar in text structure… and are in the similar space. This type of logic allows you to automatically match and remove 1-1 records, and cycle though data seamlessly to even remove good matches within the subsets, further increasing the likelihood that the lower quality matches that remain are actually good candidates for your target.
The best part of leveraging geographic relationships, augmented by phonetic logic models to conquer the data jungle is that even the simplest solutions can be incrementally built upon to help your team return immediate value without the need for fundamental shifts in your business processes. This type of incremental experience and approach allows data users to get their hands dirty with analytics, fast, and also demonstrates the immediate value of investments to stakeholders and regulators to rally their support and buy-in… all without having to overwhelm a team’s day to day activities with complexity, advanced analytic tools, software, and other items that can distract from meeting the day to day business objectives of the utility in delivering safe, affordable, and reliable services to customers.