Data Provenance of Data Stats

“Gartner reports that organisations lose $12.9M annually due to poor data quality.¹”

It’s a metric that gets plenty of airtime—I’ve referenced myself on a few occasions, in fact. We see similar figures cited regularly: 83% of data migrations fail², or the primary cause of migration failure is a "lack of concern for field mapping."

I see these claims attributed to reputable organisations daily, yet they are rarely linked back to the original study. I’ve been guilty of this myself. Recently, while researching current trends in data governance, I hit that specific claim about "field mapping" and it immediately raised my hackles. In my experience, while it might not be the most glamorous part of the stack, it’s hardly ignored by the people actually doing the work.

Perhaps it’s my training as a historian, but I have a deep appreciation for primary sources. We worry a lot about data provenance in our pipelines, yet we seem less concerned with it in our professional discourse.

When assessing an argument, the strength and age of the source are everything:

  • The $12.9M Figure: This stems from Magic Quadrant for Data Quality Solutions, published July 27, 2020. This was a self-reporting survey of 157 enterprises conducted by Gartner. It’s useful context, but we have to acknowledge the limitations of self-reporting. It’s not peer-reviewed academic research—and it doesn’t need to be—but that distinction matters. In this case, the cohort consists of enterprises already sophisticated enough to be purchasing data quality software. That’s quite a selection bias when discussing the potential financial impact in a broader context.

  • The Migration Failure Rate: This specific "83%" metric is nearly 20 years old, coming from a 2009 report from Gartner. In technology terms, that’s ancient history. Citing it today without a massive asterisk is misleading; a 2009 (or older) data center migration and a 2026 cloud-native transition are entirely different beasts. Moreover, the actual stat is that 83% of data migration projects “fail or exceed their budgets and schedules.” That ‘or’ is doing a lot of heavy lifting—a project that shipped a few weeks late and 15% over budget is fundamentally different from one that was abandoned altogether.

In both of these cases, it’s vital for savvy consumers to see the source in order to assess its methodology and motivation. If I’m reading that Gartner data quality report as a CIO at an enterprise with a global footprint that has a strong governance framework in place, I could be reasonably confident that stat could be similar for my company. But if I’m running a regional MNC with minimal controls, that could be way off for any number of reasons. If the 83% figure reflects an era of poorly understood data migration challenges, and the industry has spent the better part of 15 years developing tooling, methodology, and hard-won institutional knowledge in direct response to those failures, then the baseline failure rate today is almost certainly lower. Citing that figure today to justify a significant remediation budget or a risk escalation to the board, for example, could mean over-engineering your governance response to a problem that mature teams have substantially addressed, or creating organisational anxiety around a risk profile that no longer reflects reality. That data migration carries real risk is not in dispute. Whether that risk profile resembles 2009's is another matter entirely. The problem is, we don’t seem to have a follow-up to help guide our decision-making.

My own efforts to dig into these claims began with search engines. I started with targeted search queries—the statistic and the attributed source—to tie the metrics to specific reports. Finding the original page for the Data Quality Solutions post was easy enough. However, when Gartner’s live site went cold regarding the migration failure rate, I had to rely the Wayback Machine to find the snapshot of the original 2009 post that discussed their findings. I also used several, different AI tools to surface original citations that manual searching missed—though I treated these as "leads" only, independently verifying every link before drawing a conclusion.

This rabbit hole got me thinking about a fundamental trap in human reasoning: confirmation bias. We tend to accept information that supports our existing beliefs and only "interrogate the witness" when we disagree. Because I knew, instinctively, that field mapping is taken seriously by competent teams, I went looking for the source. If the stat had said "Architects spend too much time on documentation," I might have nodded and moved on without checking.

As the team at The Skeptics’ Guide to the Universe often points out, we should be most wary of information that perfectly aligns with our pre-existing worldviews.

This isn’t about calling anyone out. It’s a reminder—mostly to myself—to apply the same rigour to our industry "common knowledge" that we apply to the systems we build.

If we’re going to be data-driven, we should probably start with the data we use to justify our own professions.

What’s your process for verifying the statistics you cite — and do you have a cutoff date for how old a figure can be before it needs an asterisk?


And for the writers out there: how do you share your ‘works cited’ without turning a professional post into a dry, academic paper? For my part, I tend to save the full Chicago-style footnotes for my longer-form blog posts, but I’m curious how others strike that balance in a more casual context.


  1. Chien, Melody, and Ankush Jain. 2020. Review of Magic Quadrant for Data Quality Solutions. Gartner, July 27, 2020. https://www.gartner.com/document/3982237.

  2. Friedman, Ted. 2009. “Risks and Challenges in Data Migrations and Conversions.” Gartner, February 25, 2009. https://web.archive.org/web/20190817105444/https://www.gartner.com/en/documents/897512.

Previous
Previous

Governing AI in 2026: Managing Risk and Unlocking Value

Next
Next

Synthetic Data for Predictive Modelling