AI Security Wire

Published

- 5 min read

Membership Inference Attacks: How Much Training Data Can Be Recovered

img of Membership Inference Attacks: How Much Training Data Can Be Recovered

A new academic study has produced some of the most comprehensive measurements to date of training data leakage from large foundation models, finding that membership inference attacks — techniques to determine whether a specific sample was used in training — are substantially more effective against large models than previously reported benchmarks suggested. The findings have significant implications for organisations training models on data that contains personal information.

Background: Membership Inference Attacks

A membership inference attack (MIA) attempts to answer a binary question: was this specific data point in the model’s training set? The ability to answer this question has privacy implications because:

  1. Knowing that a document was in a training set reveals that the organisation had access to that document at training time
  2. Combined with memorisation, it can be used as a first step in targeted training data extraction
  3. Under GDPR’s right to erasure, if a data subject requests deletion, organisations must be able to determine whether their data was used in training — and MIAs reveal that this determination is also possible from the outside

New Results

The research team evaluated membership inference attacks against a range of open-weight foundation models (7B to 70B parameters) using both existing attack methods and a novel attack methodology they term Calibrated Likelihood Ratio (CLR) attacks.

Attack Performance

Attack TypeModel SizeAUC (True Positive @ 0.1% FPR)
Loss-based (baseline)7B0.61 (8.2%)
Min-k% (prior SOTA)7B0.67 (12.1%)
CLR (new)7B0.74 (21.4%)
Loss-based (baseline)70B0.64 (9.8%)
Min-k% (prior SOTA)70B0.72 (18.6%)
CLR (new)70B0.81 (31.2%)

The CLR attack achieves substantially higher true positive rates at very low false positive rates — the regime that matters most for practical privacy attacks. At a 0.1% false positive rate, the new method can correctly identify training members at roughly three times the rate of random chance for 70B models.

Why Larger Models Leak More

A consistent finding across the study is that larger models are more vulnerable to membership inference. The proposed explanation:

Large models have greater capacity and therefore memorise training data more extensively. During training, they can effectively store verbatim or near-verbatim copies of training examples in their weights. This memorisation is what MIAs detect — they measure whether the model has “overfit” to a specific example.

The researchers also find that memorisation is non-uniform across training data: sequences that are duplicated in the training corpus (appearing multiple times) are memorised at dramatically higher rates than unique sequences. For a 70B model, training examples that appear 5+ times in the training set are recoverable via membership inference at a true positive rate of over 50% at 0.1% FPR.

Data Extraction

Building on membership inference, the team also evaluated direct training data extraction — generating text from the model and checking whether it matches verbatim training examples. At scale, they were able to extract:

  • Personal names and email addresses from documents that appeared in the training corpus
  • Portions of copyrighted text
  • Identifiable user-generated content scraped from public sources

The extraction rate was low in absolute terms (roughly 0.01% of training data was directly extractable) but significant given the volume of text in large training corpora — a 1T token training set with 0.01% extraction rate still represents ~10M extractable tokens of potentially sensitive content.

Implications for GDPR Compliance

The right to erasure (Article 17 GDPR) requires organisations to delete an individual’s personal data upon request. For models trained on personal data, this creates a problem: it may not be sufficient to delete the training data if the model has memorised it.

The Information Commissioner’s Office (ICO) has not yet issued definitive guidance on whether model weights constitute “personal data” under the GDPR. However, this research strengthens the argument that in some cases, they do — if training data can be recovered from model weights, then the weights contain personal data in a meaningful sense.

Practical implications:

  • Organisations training models on user-generated data should conduct memorisation audits before model deployment to assess the risk of PII recovery
  • Data deduplication is not just a training efficiency measure — removing duplicates from training data significantly reduces memorisation and therefore MIA risk
  • Models trained on high-risk PII (medical records, financial data) should have differential privacy applied during training, even at a cost to model utility
  • Consider machine unlearning research as a long-term solution for GDPR-compliant model modification

Differential Privacy: Current State

The standard technical mitigation against membership inference is differentially private (DP) training. However, the research highlights a practical limitation: at privacy budgets (ε values) that provide meaningful protection against the CLR attack, model utility suffers significantly:

DP Budget (ε)MIA AUC (CLR)MMLU Score (70B model)
No DP0.8178.4%
ε = 100.7476.1%
ε = 30.6571.2%
ε = 10.5762.8%

At ε = 1 (considered strong privacy), the model loses roughly 15 MMLU percentage points — a substantial capability penalty. The community’s challenge is to develop training procedures that achieve meaningful privacy protection at a lower utility cost.

Recommendations

For organisations training or fine-tuning models on sensitive data:

  1. Deduplicate training data — remove duplicate and near-duplicate sequences before training; this is the single most effective intervention for reducing memorisation risk.
  2. Audit for PII before training — use automated PII detection to identify and remove sensitive personal information from training corpora.
  3. Apply differential privacy for high-risk data — for models trained on genuinely sensitive personal data (healthcare, financial), accept the utility cost and apply DP training.
  4. Conduct memorisation audits — before deploying models trained on proprietary or personal data, run extraction benchmarks to characterise memorisation risk.
  5. Monitor for extraction attempts — log and monitor model API queries for patterns consistent with training data extraction (repetitive prompts, systematic probing of low-temperature completions).