Rebuilding CMS’s Centralized Data Abstraction Tool to Fight Fraud at Scale

Payment integrity sounds clinical until you look at the numbers.

When diagnosis codes are unsupported, miscoded, or deliberately escalated, they distort risk scores — and risk scores directly drive Medicare Advantage payment. At national scale, even small percentage shifts translate into billions of dollars.

That’s why I’m proud of the work I led redeveloping CMS’s Centralized Data Abstraction Tool (CDAT) — the backbone of the Risk Adjustment Data Validation (RADV) process. CDAT collects, standardizes, and manages medical record submissions and abstractions so CMS can validate whether risk-adjustment diagnoses submitted by Medicare Advantage organizations are truly supported in source documentation.

Official CMS CDAT overview:
https://security.cms.gov/pia/central-data-abstraction-tool-modernized

The Scale of the Problem

To understand why CDAT modernization matters, you have to understand the magnitude of improper payments in Medicare Advantage.

According to CMS’s FY 2025 Improper Payments Fact Sheet, Medicare Part C (Medicare Advantage) had an estimated 6.09% improper payment rate — totaling approximately $23.67 billion. CMS notes that the majority of these improper payments are tied to insufficient documentation supporting submitted diagnoses.
https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2025-improper-payments-fact-sheet

CMS has further stated that Medicare Advantage plans may overbill the government by approximately $17 billion annually, and cited MedPAC estimates suggesting the number could be as high as $43 billion per year.
https://www.cms.gov/newsroom/press-releases/cms-rolls-out-aggressive-strategy-enhance-and-accelerate-medicare-advantage-audits

To be clear: improper payments are not automatically fraud. Many are documentation errors. But unsupported diagnosis coding — particularly systematic HCC escalation — creates payment distortions at enormous scale.

CMS’s RADV Fast Facts note that completed audits for Payment Years 2011–2013 identified 5–8% overpayments in audited plans.
https://www.cms.gov/files/document/cpi-radvfact-sheet.pdf

Those percentages, applied across a program exceeding $400B annually, represent significant taxpayer exposure.

What RADV Has Recovered — And What It Could Recover

Historically, CMS acknowledged that meaningful RADV recoveries have been limited, with the last major recovery tied to Payment Year 2007 audits.
https://www.cms.gov/newsroom/press-releases/cms-rolls-out-aggressive-strategy-enhance-and-accelerate-medicare-advantage-audits

However, targeted enforcement actions demonstrate what’s possible when documentation analysis is done rigorously:

In 2023, CMS finalized a controversial RADV rule strengthening extrapolation authority and removing the FFS adjuster. CMS projected the rule could yield approximately $4.7 billion in recoveries over 10 years (2023–2032).
https://www.sidley.com/en/insights/newsupdates/2023/02/cms-issues-highly-controversial-final-medicare-advantage-audit-rule

Those numbers underscore why scalable audit infrastructure matters.

What We Modernized in CDAT

Redeveloping CDAT wasn’t about interface improvements — it was enterprise modernization in a domain where every workflow must be secure, defensible, and auditable.

We redesigned the system to support:

  • Secure ingestion and storage of medical records and supporting documentation

  • Structured abstraction workflows aligned with RADV methodology

  • Full lifecycle tracking — intake, review, reconsideration, and error calculation

  • Scalable architecture capable of handling increasing audit volume

NewWave’s award coverage described CDAT as automating and streamlining RADV record flow operations:
https://newwave.io/newwave-awarded-recompete-to-provide-cms-with-cdat/

The modernization laid the foundation for analytics at scale.

Where AI — Especially NLP — Changes the Game

The core RADV challenge is that the evidence supporting (or not supporting) diagnosis codes lives inside unstructured medical records: physician notes, discharge summaries, scanned documents.

Manually reviewing that at national scale is slow and inconsistent.

That’s where Natural Language Processing (NLP) and machine learning become transformative.

NLP can:

  • Extract clinical concepts from free-text documentation

  • Compare extracted evidence to submitted HCC / ICD codes

  • Identify unsupported or contradictory diagnoses

  • Detect systemic upcoding patterns across contracts

Research shows NLP enhances risk adjustment validation and coding accuracy:

AI & NLP in Risk Adjustment (IQVIA):
https://www.iqvia.com/-/media/iqvia/pdfs/library/fact-sheets/iqvia-natural-language-processing-risk-adjustment-solution.pdf

Automated Medical Coding Research (NIH / PMC):
https://pmc.ncbi.nlm.nih.gov/articles/PMC11835781/

Fraud Detection via Machine Learning (NIH / PMC):
https://pmc.ncbi.nlm.nih.gov/articles/PMC10173919/

Fraud detection in healthcare is fundamentally an anomaly detection problem — rare events, imbalanced datasets, shifting behaviors. AI is uniquely suited to identify subtle patterns humans miss.

Catching HCC Escalation at Enterprise Scale

Medicare Advantage fraud, waste, and abuse often isn’t obvious fake claims — it’s systematic HCC code escalation and unsupported diagnoses that increase risk scores and therefore payment rates.

Policy commentary has repeatedly raised concerns about risk score inflation and coding intensity:
https://assets.arnoldventures.org/pdf-previews/AV-Comment-Letter-on-2026-MA-Advance-Notice.pdf

To detect that at enterprise scale requires:

  • Petabyte-scale data pipelines

  • Statistical modeling of diagnosis distribution shifts

  • NLP over medical record documentation

  • Repeatable, auditable abstraction workflows

That’s where my expertise intersects directly with CDAT modernization — building AI-ready infrastructure capable of supporting defensible, large-scale payment error analysis.

The Bottom Line

Improper Medicare Advantage payments are measured in tens of billions annually. Individual audit recoveries can reach hundreds of millions. CMS projects strengthened RADV enforcement could recover $4.7B over a decade.

Modernizing CDAT was about giving CMS the scalable infrastructure necessary to:

  • Measure payment error accurately

  • Identify systemic unsupported coding

  • Target investigations intelligently

  • Support defensible recovery actions

This is what modern payment integrity looks like: secure infrastructure paired with AI, NLP, and statistical rigor — applied at national scale to protect taxpayer dollars.

Previous
Previous

Re-Engineering the Chronic Conditions Data Warehouse for the AI Era

Next
Next

Bridging the Gap between Perceived Success and Actual Success