Amperity's Revolutionary Approach to Enterprise-Scale Matching

Matching is the heart of the identity resolution process. This is where we compare records and draw connections in three key phases: blocking, scoring, and clustering. Bear in mind, the Amperity platform is specifically optimized for enterprise-scale data and identity resolution for small datasets uses fundamentally different techniques than those applied to billions of records.

While we remain committed to helping large-scale consumer brands solve their most pressing identity challenges, let’s dive into our current three-phase approach.

Blocking

Identity resolution across hundreds of millions of records creates a massive computational challenge. Without optimization, a 20-million-record dataset would require approximately 0.4 quadrillion comparisons—taking weeks when brands need results in hours.

That's why Amperity uses blocking to make identity resolution at scale a possibility. This technique strategically divides records into smaller groups with higher matching probability, dramatically reducing comparisons and computational load.

We are committed to innovating in this critical area, developing cutting-edge approaches like Dynamic Blocking to further improve both speed and accuracy across diverse data types and formats.

Data Science Deep Dive

During blocking, the complete dataset is divided up into smaller blocks designated by “blocking keys”. A blocking key, <a_i, f_i> is a tuple consisting of a semantic attribute ai and a function f_i. The function f_i is usually very cheap to compute. For example, it uses the attribute a_i as input and returns its phonetic encoding, a substring, or itself.

A possible blocking key is the concatenated string of the first three characters of the given name and the first three characters of the surname, which can be represented as <FN, F3> + <LN, F3>. Amperity applies the blocking strategy to the entire unioned virtual table to capture any potential matching across tables, plus we can harvest the performance gain globally.

Pairwise Comparison and Scoring

Here, we compare each pair of records within a block with our patented high-precision ML model. Built solely to match customer records, Amperity’s model is equipped for the subtleties of different types of customer data, like:

Exact matches of a loyalty number
Fuzzy matches of email addresses based on Levenshtein distance
Probabilistic matches based on the rarity of names in a given region
Probabilistic matches based on email tokens that combine first name, last name, or birthdate that are elsewhere in the customer record

The final output is one of six classifications for each record pair:

Exact Match: Records that irrefutably belong to the same individual.

Excellent Match: Very high confidence these records represent the same person.

Great Match: Strong indicators of a shared identity.

Good Match: Reasonable confidence in the connection.

Weak Match: Some indicators suggest these records belong to the same individual, but connections aren't strong.

Non-Match: High likelihood these records represent different people.

These classifications allow brands to make well-informed decisions about how to use their customer data.

a sample graph showing 28 different pair classifications

If a block has eight records in it, each record is compared against the other seven records for a total of 28 classifications. Take a look under the hood for a better idea of how intelligent this model is:

Records 1 and 2 are an “exact match” because they shared the same loyalty number and SSN
Records 1 and 3 are an “excellent match” because the email addresses were highly similar and there was a full match on a rare first name and a rare last name
Records 1 and 4 were a “non match” because their SSNs were different

Clustering

Clustering determines which records belong in each customer 360 profile. With our rich graph of connections, records can be organized according to your brand's specific goals.

We collaborate with each brand to establish an ideal matching threshold, or the minimum match quality required for records to be clustered together. Lower quality matches may still be included if there’s a transitive connection through higher quality matches.

Once a ‘high’ threshold is established and lower-quality matches excluded, distinct customer profiles emerge. This approach resolves potential conflicts, such as keeping business travelers separate from leisure travelers, ensuring your customer profiles align with your business needs.

a graph depicting connections and their respective strengths

The sample data shows that record 1 is transitively connected to record 5 through their shared connection to record 2. Additional techniques are then required to manage records of this complexity across large datasets. At this stage, the system groups the records into sets—or Amperity Clusters—that represent one real-world customer and all their associated touchpoints.

a graph showing one customer and all of their related touchpoints

Amperity's AI-powered identity resolution transforms disparate data into a rich, trusted Customer 360 that drives best-in-class marketing, analytics, and customer experiences.

To learn more about unlocking and activating your data to drive real business value, check out our comprehensive identity resolution guide.