Welcome to our blog series on decoding identity resolution. This is a nine part blog that offers an attempt at a friendly, comprehensive view of how to think about the concept of identity resolution as well as how to interpret the way it is represented in marketing and sales materials by different companies across the tech landscape. The other articles in the series can be found here:
Have you ever done any of the following?
Moved from one home to another
Had more than one email address
Changed your name (due to marriage or another reason)
Used a nickname intermittently
Accidentally misspelled something when entering your info
Shipped something to a family member’s house or business address
If so, congratulations! You’re a person who does things a person does all the time!
Unfortunately, you are also likely a duplicate in someone’s data set.
This is the core problem an advanced identity resolution algorithm is focused on solving. All data management systems are basically tools to translate the way people interact with the world into something predictable and useful. As cloud computing has expanded, we have more advanced tools that can help us account for the very human ways we interact with the digital world.
Two similar but different terms are used to refer to this in the context of identity resolution: “machine learning” (ML) and “artificial intelligence” (AI).
Defining ML and AI for identity resolution
Machine learning algorithms are a combination of thousands of granular software rules that can be “trained.” You create the algorithm so that it can adapt based on whether it was correct, then manually create a “training data set,” flag correct and incorrect results thousands of times, and then finally feed it back into the algorithm which lets it learn how to do better next time (by adding even more rules).
Artificial intelligence is more vague. In general, AI algorithms are programmed to have a nuanced understanding of humans and how they interact with data. In the data management and marketing space it basically means codifying expertise into the software so that the end user doesn’t have to understand the minutiae of the problem to get value from the software.
You might think, “That sounds great, sign me up!” You’re right to be excited, but the difficult part is coming up. ML and AI concepts are deep into computer science — even if you have a team of developers you may not have access to people who understand these concepts well enough to build them. Below we’ll give an overview, and in the next entry in this series we’ll arm you with some knowledge you can use to better navigate the data landscape and tell which tools are doing something legitimately interesting versus which tools are just latching onto market trends.
Advanced identity resolution
An advanced identity resolution algorithm will use ML or AI to help account for inconsistent or dirty data.
Some of the ways this can be done use a collection of “models.” A model is basically a rules-based algorithm with thousands upon thousands of granular and evolving rules. The result has far more intelligence built into it. A good advanced identity resolution tool will use a collection of elegantly applied models, analyze the results, and approximate “confidence” in the form of a score or weight.
Some of the model types that an advanced identity resolution algorithm will use include:
Comparing data against a library of “truth” data sets to affect how much weight a match carries
Using combinations of data to affect the weight of a match
Accounting for known bad data values
Intelligently standardizing or cleaning data
Example: An advanced algorithm that intelligently standardizes data before comparing it to maximize efficacy, thus accounting for “206-555-5555” and “(206) 555-5555” being the same phone number, a correspondence that would fail a simple string match. Different types of data benefit from different types of standardization.
Comparing different data types
These are just a handful of examples, but all of these operations are complex, and implementing them with the “cascading rule sets” legacy models use can make processing times take days, whereas advanced data science approaches can do it in minutes.
The costs of getting data wrong have a ripple effect. For a more detailed look at the impact of inaccurate identity resolution, check out part two of this series.
Tradeoffs of advanced data science identity resolution
Every choice of tool or algorithm has its tradeoffs; here are the pros and cons of an advanced data science identity resolution strategy.
Limitation to how fast it can be processed
Will handle large-scale data
Less configuration work to maximize benefits
Not a good choice for use cases where the risk of being “wrong” is high