Your Data Scientists Were Hired to Build Models. They're Cleaning Spreadsheets Instead.

Poor data quality costs the average enterprise $12.9 million per year, according to Gartner. That number gets cited constantly, but it's almost always presented as an abstraction, a line item nobody can trace to a specific team or budget. So where does the money actually go?

A lot of it goes to your data science team. Not to the models they were hired to build or the insights they were hired to surface, but to the hours they spend cleaning, deduplicating, reformatting, and reconciling data before they can do any of that. The cost of poor data quality isn't hiding in some invisible corner of the enterprise. It's sitting in your headcount budget, compounding every pay period.

Where data science budgets actually go

The claim that data scientists spend 80% of their time cleaning data has circulated for over a decade. The actual numbers are more nuanced, and more useful.

Anaconda's State of Data Science survey found that data scientists spend roughly 45% of their time on data preparation tasks, with cleaning and organizing alone accounting for over 26% of the average workday. (This benchmark comes from the 2020 edition of the survey, the most recent year Anaconda published specific time-allocation percentages; the survey has since shifted its focus to AI adoption trends.) Earlier surveys from CrowdFlower (now Figure Eight) reported higher figures: 60% on cleaning and organizing, reaching 80% when data collection and labeling were included. Kaggle's survey put time spent specifically on cleaning at around 15%, though that used a narrower definition excluding collection and loading.

The honest range is somewhere between 25% and 80%, depending on what you count. But even the conservative end of that range gets expensive fast when you look at who's doing the work.

Average U.S. data scientist salaries sit between $118,000 and $152,000, depending on the source. Senior and specialized roles push past $165,000. For a team of 10, the math looks like this:

At an average salary of $135,000 per data scientist:

45% of time on data prep (Anaconda's figure): $607,500/year spent on preparation work
60% of time on data prep (CrowdFlower's figure): $810,000/year
80% of time on data prep (upper bound): $1,080,000/year

Even the conservative estimate means over $600,000 a year in data science compensation going to work that doesn't require a graduate degree, a Kaggle ranking, or expertise in neural networks. It requires someone who can write SQL joins, maintain brittle Python scripts, and spot date-format inconsistencies across a dozen source systems. That code accumulates its own technical debt: ongoing maintenance, troubleshooting, and debugging that compounds the time sink quarter over quarter. That's not a data science problem. It's a resource allocation problem.

The compounding cost most leaders miss

Salary waste is the visible cost. The compounding costs are harder to measure but often larger.

Opportunity cost is the big one. Every sprint your team spends reconciling customer records across 10 systems is a sprint they didn't spend building a churn prediction model, a lifetime value segmentation, or a recommendation engine. If your data scientists are spending 45% to 60% of their time on prep, reclaiming even half of that effectively doubles or triples their capacity for revenue-generating analysis, without adding a single headcount. Those unbuilt models represent revenue that competitors with cleaner data foundations are already capturing. A 2025 IBM Institute for Business Value report found that 43% of chief operations officers now identify data quality as their most significant data priority, and over 25% of organizations estimate they lose more than $5 million annually from data quality issues alone.

Then there's attrition. Data scientists who spend years cleaning records instead of building models leave. They don't leave quietly, and they don't leave cheaply. Standard talent acquisition benchmarks put the cost of replacing a technical hire at 50% to 200% of annual salary. For a senior data scientist at $165,000, that's $82,000 to $330,000 per departure in recruiting, onboarding, and lost productivity. A PhD-holding data scientist quitting because they spent three years removing duplicates isn't a thought experiment. It's a pattern data leaders recognize immediately.

The third cost is trust erosion. When a data science team can't deliver insights on timeline because they're buried in prep work, the business stops asking. Executives make gut decisions instead of data-informed ones. The data function gets sidelined, and the next budget cycle becomes a fight for survival rather than expansion. Thomas Redman, writing in MIT Sloan Management Review, estimates that poor data quality costs most companies 15% to 25% of revenue, based on research from Experian and independent consultants.That revenue loss doesn't show up as a data science line item. It shows up as missed targets across marketing, sales, and operations, with the data team absorbing the blame.

Why customer data is the worst offender

Not all data quality problems are created equal. Customer data is disproportionately messy, and it's where the cost of poor data quality concentrates most heavily for business-to-consumer (B2C) enterprises.

The reason is structural. Customer records span more systems than almost any other data type: customer relationship management (CRM), point of sale (POS), email platforms, loyalty programs, web analytics, mobile apps, support tickets, and paid media. A mid-size B2C brand might ingest customer data from 10 to 15 sources. An enterprise retailer or financial services company could pull from 30 or more. Each source has its own schema, its own formatting conventions, and its own definition of what a "customer" is.

The costliest data quality problem in that mix is customer identity. It's expensive to do well and even more expensive when it's done poorly. The same person appears as five different records: "John Smith" in the CRM, "john_smith" in the loyalty system, "J. Smith" in the email platform, "SMITH, JOHN" in the POS, and "jsmith@email.com" in the support database. They're all the same customer, but to every downstream system, they're five separate people. Date formats ship in seven variations. Phone numbers arrive as strings, integers, and occasionally scientific notation. Revenue fields mean different things in different systems.

Data scientists on B2C teams don't spend their prep time on exotic analytical challenges. They spend it on identity resolution: matching, merging, and deduplicating customer records across sources so that the data is usable for anything at all. Churn models, segmentation, personalization, lifetime value (LTV) analysis: none of it works until someone answers the question "who is this customer, and how many records do they actually have?" Enterprise-scale identity resolution is a well-proven category of infrastructure. The manual approach persists not because solutions don't exist, but because organizations haven't invested in automating it.

Paying a data scientist $150,000 a year to deduplicate customer records is like hiring a structural engineer to lay bricks.

What solving this at the foundation actually looks like

The fix for customer identity fragmentation isn't hiring more data scientists or buying better data prep tools. It's resolving identity centrally, across the organization, so that every team models and reports from the same foundation. When that baseline exists, it creates a consistent layer of accuracy that every downstream system, analyst, and model can trust before a data scientist ever touches the data.

Automated Identity Resolution, combining deterministic rules with probabilistic and machine-learning-based matching, can resolve customer records across sources by evaluating name variations, email patterns, phone numbers, mailing addresses, and behavioral signals in combination. It handles the "John Smith vs. SMITH, JOHN vs. jsmith@email.com" problem at scale, continuously, without a data scientist writing custom SQL, maintaining fragile matching scripts, or manually deduplicating records. Format standardization, deduplication, and cross-source merging happen as part of ingestion, not as a manual post-processing step.

When a data science team starts with a clean, deduplicated customer foundation, the ratio inverts. Instead of spending 45% to 80% of their time on prep, they spend it on the work they were hired for: modeling, analysis, and activation. The churn model gets built. The segmentation ships. The LTV analysis informs next quarter's budget. A team that reclaims even half of its prep time can realistically produce three to five times the analytical output, not because the people are working harder, but because they're finally working on the right problems.

Amperity's Customer Data Cloud is built around this exact problem. Identity Resolution is the core of the platform: it ingests raw customer data from any source, resolves identities using multi-patented, AI-powered deterministic and probabilistic matching, and delivers trusted, contextual customer profiles that downstream teams and systems can act on. It's not a data cleaning tool. It's the infrastructure layer that makes manual cleaning unnecessary for the single largest category of prep work in B2C data science.

If the numbers in this post look familiar, if you recognize the salary math or the attrition pattern or the unbuilt models, it may be worth seeing what your customer data actually looks like under the hood. Request a customer data audit and find out how much of your data science budget is going toward work that your infrastructure should be handling.

Data quality FAQ

How much time do data scientists spend cleaning data?

Industry surveys report a range depending on what activities are included. Anaconda's State of Data Science survey found that data scientists spend approximately 45% of their time on data preparation, with cleaning alone at 26% or more. CrowdFlower (now Figure Eight) surveys from 2015 to 2017 reported 60% on cleaning and organizing, reaching 80% when data collection and labeling were included. Kaggle's 2018 survey found roughly 15% of project time on cleaning specifically. The consensus: data preparation is the single largest time commitment in most data science roles, though the exact percentage varies by organization and methodology.

What does poor data quality cost an enterprise?

Gartner estimates an average of $12.9 million per year per organization. Thomas Redman, writing in MIT Sloan Management Review (2017), estimates that poor data quality costs most companies 15% to 25% of revenue, based on research from Experian and independent consultants. A 2025 IBM Institute for Business Value report found that over 25% of organizations estimate annual losses exceeding $5 million from data quality issues alone.

What is customer Identity Resolution?

Identity Resolution is an automated approach to matching and merging customer records across multiple data sources, using deterministic rules and probabilistic algorithms to create accurate, contextual profiles of each person. It addresses the duplicate records, inconsistent name formats, and fragmented profiles that accumulate when customer data spans many systems (CRM, email, POS, loyalty, web analytics, and others). For B2C enterprises, Identity Resolution removes the largest single category of manual data preparation work by automating the matching that data scientists would otherwise do by hand.

Can automation replace manual data cleaning for customer data?

For structured, repeatable problems like customer identity resolution, yes. Probabilistic and deterministic matching algorithms handle name variations, address standardization, and cross-source deduplication at scale without human intervention. Amperity's Identity Resolution uses multi-patented, AI-powered matching to do this continuously as new data arrives. General-purpose data cleaning (handling one-off schema changes, interpreting ambiguous business logic, validating domain-specific data) still requires human judgment. But automating identity resolution removes the single largest category of manual work for most B2C data teams.

How does Identity Resolution differ from manual data deduplication?

Manual deduplication typically relies on exact-match rules (same email, same phone number) applied one source at a time. Identity Resolution evaluates multiple signals in combination, including name variations, address patterns, behavioral data, and transitive connections between records, to identify matches that exact-match logic misses. It runs continuously as new data arrives, so profiles stay current without requiring repeated manual effort. For organizations with customer data across 10 or more systems, the difference between the two approaches can be hundreds of hours of data science time per quarter.

What should data leaders look for in an Identity Resolution solution?

Focus on four areas. First, matching methodology: does the solution combine deterministic (exact-match) and probabilistic (algorithmic) approaches, or rely on just one? Second, transparency: can your team see how and why records were matched, or is the logic opaque? Third, adaptiveness: does the system update profiles as new data arrives, or require periodic reruns? Fourth, architecture fit: does the solution work with your existing data warehouse or lakehouse, or require migrating data into a separate environment? Solutions that cover all four reduce both the data prep burden and the ongoing maintenance cost.

How do you calculate the ROI of improving data quality?

Start with the direct cost: take your data team's total compensation and multiply by the percentage of time spent on preparation work (use 45% as a conservative baseline). That's your annual spend on prep. Then estimate the opportunity cost: what's a realistic revenue impact if your team shipped two to three more models per year? Factor in attrition savings if you're losing technical talent to data prep frustration (replacement cost runs 50% to 200% of salary per departure). Compare those combined costs against the investment in automated infrastructure. For most B2C enterprises with 10 or more customer data sources, the payback period is well under 12 months.

Why is customer data harder to clean than other enterprise data?

Customer data spans more source systems than nearly any other data type, and each system captures identity differently. A single customer might appear as five or more distinct records across CRM, POS, email, loyalty, and web analytics, each with different name formats, contact details, and identifiers. Unlike financial or inventory data (which typically has standardized schemas and clear primary keys), customer data lacks a universal identifier. That fragmentation makes deduplication and matching orders of magnitude more complex, and it's why identity resolution specifically, not general-purpose data cleaning, is the highest-impact automation investment for B2C data teams.