Welcome to our blog series on decoding identity resolution. This is a nine part blog that offers an attempt at a friendly, comprehensive view of how to think about the concept of identity resolution as well as how to interpret the way it is represented in marketing and sales materials by different companies across the tech landscape. The other articles in the series can be found here:
Deterministic (or “rules-based”) identity resolution is the most commonly used method for identity resolution. In this entry, we’ll explore what it involves and how to think about it.
Deterministic identity resolution
When we say “deterministic” what we mean is that the matches are an exact match of the values, and the rules are simple and minimal. The results prioritize predictability over accuracy. This is very important for operational use cases like associating a person with their payments, but ultimately insufficient for most other use cases.
In general, if an application or platform does not offer great detail about its identity resolution features, it usually means it’s using something deterministic.
Rules-based identity resolution is the most straightforward way of solving an incredibly complex problem. The vast majority of providers are interested in creating marketer tools, analytics features, or workflow features but sidestep providing a robust identity resolution solution because it's “too hard.” These tools all offload the work on their customers.
When you are looking at data management solutions, look for boxes in their architecture diagrams labeled “ETL” which stands for “Extract-Transform-Load.” This is a surefire indication that your technical teams will be on the hook for writing entirely custom jobs to prepare data to conform to the requirements of their tool, rather than being able to input data in whatever format it naturally occurs and letting the tool sort it out. This is incredibly time-consuming, and results in a “garbage in, garbage out” problem.
Let’s take a look at a couple of rules-based solutions commonly seen in the market.
This is a common element of rules-based identity resolution. It means that the application chooses one or a combination of fields in the data and declares that a “unique identifier”.
Most tools specializing in email marketing use email addresses as unique identifiers. This means that if you write code or load data in, it will simply lookup profiles via email addresses, and if there’s a direct match, it will pair the person and correlate data together.
Another common technique, a cascade of different rules allows for more flexibility in matching algorithms and is commonly how in-house-built identity resolution algorithms handle the problem.
This also shows up in product demos as an easily configurable way to control how identity resolution works. The simplicity of the algorithm means that you can typically control the rules without knowing how to write any code and still get a predictable result.
For example, a simple rule set might be something like:
Lookup an email address
If there’s no match, look up a combination of first name, last name, and street address
If there’s still no match, look up on last name and phone number
The perceived advantage is that these are predictable and teams can have a clear discussion on the rules.
Another legacy answer to this problem is using a “score table” that weights different pieces of PII and creates a series of rules. This is similar to the cascading rules concept but with more options for fine-tuning.
If you only have a first and last name for someone, that doesn’t count as knowing who they are. Even though my name, Caleb Benningfield, isn't exactly common, there are at least a handful of other people in the United States with the same name. The score table allows you to then assign points for each type of matching data and establish a threshold for a minimum amount of information required to confidently match people.
First names match - 1 point
Last names match - 1 point
Emails match - 4 points
Phone numbers match - 1 point
Addresses match - 2 points
With the above scores you can set a threshold at five points. That will give you results like the following:
First and last name only - 2 points, NO MATCH
Last name and email - 5 points, MATCH
First, last, phone number and address - 5 points, MATCH
First name, phone number - 2 points, NO MATCH
Then you can tune it to the preferences of your organization.
Optimizing and “Probabilistic”
Optimizing deterministic identity resolution
Some platforms make rules-based matching even more robust by allowing for basic “string” matching algorithms (or text matching algorithms) that can account for the different ways people type in their information. This is often referred to as “fuzzy matching” and includes things like seeing how many characters are different between names and counting it as a match if it’s below one or two characters.
A way to make this more effective is running a process to standardize data, which improves results by eliminating common anomalies, but also makes it slower to process.
Look out for any companies claiming “probabilistic” ID resolution — it most likely means just introducing any probability into the equation.
The example of “fuzzy matching” to account for common variances in how data is entered means making some guesses, which technically introduces probability.
While it does add a layer to the process, it’s a minimal improvement framed to make a rules-based solution seem more sophisticated.
Tradeoffs of deterministic identity resolution
Every choice of tool or algorithm has its tradeoffs. Below are how to think about the upsides and downsides of a rules-based identity resolution algorithm.
Can be the fastest option if implemented correctly
Less accurate which can lead to bad customer experiences, inaccurate analytics, etc.
Only faster if you choose the simplest rules with the right infrastructure
Better for “operational” use cases where the risk of being wrong is high
Next up we’ll be talking about the bleeding edge of identity resolution: advanced data science.