Decoding Identity Resolution, Part Four: Advanced Data Science Identity Resolution

Welcome to our blog series on decoding identity resolution. This is a nine part blog that offers an attempt at a friendly, comprehensive view of how to think about the concept of identity resolution as well as how to interpret the way it is represented in marketing and sales materials by different companies across the tech landscape. The other articles in the series can be found here:

Part One: The Basics
Part Two: The Value of Identity Resolution
Part Three: Rules Based Identity Resolution
Part Four: Advanced Data Science Identity Resolution
Part Five: The Green Checkmark Effect
Part Six: Digital Identity Resolution
Part Seven: Demystifying Sales Narratives
Part Eight: The Amperity Perspective
Part Nine: Conclusions

Introduction

Have you ever done any of the following?

Moved from one home to another
Had more than one email address
Changed your name (due to marriage or another reason)
Used a nickname intermittently
Accidentally misspelled something when entering your info
Shipped something to a family member’s house or business address

If so, congratulations! You’re a person who does things a person does all the time!

Unfortunately, you are also likely a duplicate in someone’s data set.

This is the core problem an advanced identity resolution algorithm is focused on solving. All data management systems are basically tools to translate the way people interact with the world into something predictable and useful. As cloud computing has expanded, we have more advanced tools that can help us account for the very human ways we interact with the digital world.

Two similar but different terms are used to refer to this in the context of identity resolution: “machine learning” (ML) and “artificial intelligence” (AI).

Defining ML and AI for identity resolution

Machine learning algorithms are a combination of thousands of granular software rules that can be “trained.” You create the algorithm so that it can adapt based on whether it was correct, then manually create a “training data set,” flag correct and incorrect results thousands of times, and then finally feed it back into the algorithm which lets it learn how to do better next time (by adding even more rules).

Artificial intelligence is more vague. In general, AI algorithms are programmed to have a nuanced understanding of humans and how they interact with data. In the data management and marketing space it basically means codifying expertise into the software so that the end user doesn’t have to understand the minutiae of the problem to get value from the software.

You might think, “That sounds great, sign me up!” You’re right to be excited, but the difficult part is coming up. ML and AI concepts are deep into computer science — even if you have a team of developers you may not have access to people who understand these concepts well enough to build them. Below we’ll give an overview, and in the next entry in this series we’ll arm you with some knowledge you can use to better navigate the data landscape and tell which tools are doing something legitimately interesting versus which tools are just latching onto market trends.

Advanced identity resolution

An advanced identity resolution algorithm will use ML or AI to help account for inconsistent or dirty data.

Some of the ways this can be done use a collection of “models.” A model is basically a rules-based algorithm with thousands upon thousands of granular and evolving rules. The result has far more intelligence built into it. A good advanced identity resolution tool will use a collection of elegantly applied models, analyze the results, and approximate “confidence” in the form of a score or weight.

Some of the model types that an advanced identity resolution algorithm will use include:

Comparing data against a library of “truth” data sets to affect how much weight a match carries
- Example: A dataset that knows how common names are might make rare names worth more in a match
Using combinations of data to affect the weight of a match
- Example: Using the “name” dataset but indexing it by postal code, then intelligently determining how common a name is for a given locale whenever a record has both name and postal code.
Accounting for known bad data values
- Example: There are plenty of known bad values in personal data, such as business addresses or fake email addresses. An advanced algorithm can take into account bad data and remove it from the calculation.
Intelligently standardizing or cleaning data
- Example: An advanced algorithm that intelligently standardizes data before comparing it to maximize efficacy, thus accounting for “206-555-5555” and “(206) 555-5555” being the same phone number, a correspondence that would fail a simple string match. Different types of data benefit from different types of standardization.
Comparing different data types
- Example: if you know first and last name then a common email address would be “first.last@domain.com” or “f.last@domain.com”. You can add weight to the match by seeing that the names are present in the email address.
Handling nicknames
- Example: Being able to understand that Kimberly and Kim can refer to the same person. A rules-based algorithm won’t treat them as the same, but an advanced algorithm would be able to check names against their possible nicknames and count it as a match.

These are just a handful of examples, but all of these operations are complex, and implementing them with the “cascading rule sets” legacy models use can make processing times take days, whereas advanced data science approaches can do it in minutes.

The costs of getting data wrong have a ripple effect. For a more detailed look at the impact of inaccurate identity resolution, check out part two of this series.

Tradeoffs of advanced data science identity resolution

Every choice of tool or algorithm has its tradeoffs; here are the pros and cons of an advanced data science identity resolution strategy.

Upsides	Downsides
More accurate	Limitation to how fast it can be processed
Will handle large-scale data	More expensive
Less configuration work to maximize benefits	Not a good choice for use cases where the risk of being “wrong” is high
Cleaner data
More future-proof

Next up

We’ll be talking about a huge risk people encounter when shopping for data management platforms I call “The Green Checkmark Effect.” Click here to advance to part five of our series on decoding identity resolution.