Consumer brands have terabytes of customer data that exists in diverse formats, within many different disconnected systems across the organization. Because brands increasingly compete on customer experience, using this data for more personalized, targeted, and seamless marketing and operations is vital to their success.
But before brands can use their data, records within and across data sources must be connected through a process called identity resolution. Identity resolution techniques range from purely mathematical, to machine learning-based, with the latter approach resulting in better match rates and higher accuracy and precision. More effective identity resolution results in richer customer profiles and the maximal use of data.
In this post we will outline the unique challenges associated with resolving customer identities for marketing and operational use cases, and the specific methods that result in the best outcomes.
Challenge 1: Issues with Unique Identifiers
When a unique identifier is available, accurate, consistent, and stable across all customer databases, we can use simple SQL statements to accurately and efficiently link records. However, these four criteria are very seldom all satisfied when dealing with customer databases. This is due, in part, to the fact that companies are constantly restructuring, adding, or swapping out systems, resulting in a lack of unique identifiers across data sources.
Real life further complicates the situation. Consider the following characteristic scenarios:
Case 1:Unique identifiers are available but incomplete. In an airline’s booking database, the loyalty membership number serves as the unique identifier for the customer database. However, only a fraction of booking records have the unique identifier because not all the travelers are loyalty members and the members may not accurately provide their loyalty membership number for every trip.
Case 2:Unique identifiers are available but inconsistent. Two companies merge and wish to consolidate their customer databases into one. However, the databases use different unique identifier systems that are not compatible with one another.
Case 3:Unique identifiers are available but duplicated. When incentivized by promotions, customers will sign up for multiple accounts. In this case, the same customer can have multiple unique identifiers. This situation can also occur due to name variations, address changes, or failures to log into the website.
Case 4:Unique identifiers are unavailable. A winery hosts a wine tasting event and asks the visitors to provide their contact information. To identify the visitor’s identity, only common personal attributes like names, addresses, and emails are available. The quality of self-provided information can be low and some personal details can be wrong and incomplete, making connecting with other databases even more challenging.
The best strategy for connecting customer records with or without unique identifiers is with a machine learning-based approach. Algorithms learn from the data using available identifiers and other customer information, and generate predictive models that can assign likelihoods to linkages when data is missing or inconsistent. More on this in the next section.
Challenge 2: Lack of Ground-Truth
To discern the likelihood of whether a pair of records matches or not, we need to train the predictive model with ground-truth data that specifies whether two records correspond to the same real-world person or not. This type of machine learning that uses labeled ground-truth data is called supervised machine learning. The quality and quantity of the ground-truth data are critical to the performance of the predictive model.
However, in customer databases, this type of ground-truth is usually unknown and unknowable. Perfect ground-truth would require contacting individual customers and asking for the correctness of their personal details, which is prohibitively expensive and invasive, especially for larger brands. Manually labeling the data can help somewhat, but this approach is time-consuming, error-prone, and impossible at scale.
The solution to this problem involves synthesizing labeled ground-truth data, with an emphasis on business-specific logic. There are two parts to this approach:
When a unique identifier is available, the unique identifier is used as an intrinsic label to join records.
When a unique identifier is unavailable, rules are employed to classify candidate record pairs into matches and non-matches. Records that cannot be covered by the rules are left for manual labeling. This technique not only significantly reduces the labeling workload but also provides the business owner an opportunity to inject business logic into the decision-making process.
Challenge 3: Computational Complexity for Big Tables
When assessing the likelihood of matches within a single table, each record potentially needs to be compared with all other records in that table. And when assessing the likelihood of matches across two, three, or more tables, each record from one table potentially needs to be compared to all records in all other tables. This requires massive computational speed and power.
The computational complexity of identity resolution grows quadratically as the size of databases grows (see the figure below). For example, suppose you have two tables, Table A and Table B, and each contains 20 million records. To compare all the records in Table A with all the records in Table B, we need to make 20 million x 20 million = 0.4 quadrillion (quadrillion:1015) comparisons.
To enable identity resolution at this scale, a distributed data infrastructure is required. One that can expand and contract based on fluctuating requirements for speed and scale. It is also necessary to employ advanced indexing and blocking techniques. This filters out record pairs that are very unlikely to match, while leaving candidate pairs for more detailed comparison and classification, ultimately bringing down the computational load required.
Challenge 4: Balancing the Probability of Correct and Incorrect Matches With Uncertain Data
Because of the inherently uncertain nature of customer data, the way that identities are resolved must correspond to how the data will be used. For example, for the types of use cases the US census bureau is driving, identity resolution leverages mathematically-based techniques such as binary classification. This results in a higher number of false negatives and few-to-no false positives, which is appropriate for how that organization will use their data.
For marketing use cases, however, a more complex and flexibly probabilistic approach is necessary to get the most value from the data. This allows brands to pivot connections in different ways, prioritizing the recall rate or precision, (or somewhere in between) depending on the use case.
For example, in targeted social media advertising, the costs associated with incorrect matches (which corresponds to greater numbers of correct matches) are small. And the rewards are significant. By optimizing for more matches (both correct and incorrect) many more consumers will see ads that are relevant to them as individuals, driving revenue. The incorrectly matched records result in consumers seeing some ads that are less relevant to them, with little harm incurred.
However, for direct customer communications, especially ones involving transactional information, incorrect matches not only spoil the user experience, but sometimes also create legal complications. For these use cases, brands must err on the side of fewer correct matches, but zero incorrect matches.
Finally, for personalized marketing through email, direct mail, site, and mobile app, brands will want to strike a balance between correct and incorrect match numbers, in order to optimize the value of the data.
A Platform Designed for Marketing-Optimized Identity Resolution
Considering the challenges and the value associated with usable customer data, Amperity built a platform that was designed from the ground up for marketing-optimized identity resolution.
The Amperity Intelligent Customer Data Platform:
Uses machine learning to match records, improving match rates and accuracy over time, even when unique identifiers are incomplete, inconsistent, or unavailable
Rapidly matches terabytes of data across trillions of records, leveraging a scalable, distributed data infrastructure
Links data probabilistically and flexibly so brands can control for incorrect matches based on particular use case
If your team wants to get started using a scalable, probabilistic, machine learning-driven approach to identity resolution, schedule a demo today.