What is a membership inference attack and what does it reveal?

A membership inference attack (MIA) determines whether a specific data sample was used in a model's training set by measuring the model's 'surprise' (perplexity) at that sample: training data typically yields lower perplexity than unseen data. Successful attacks can reveal that an organisation had access to a document at training time and serve as a first step toward extracting that data verbatim.

Why are larger models more vulnerable to membership inference attacks?

Larger models have greater representational capacity and therefore memorise training data more extensively. They can store verbatim or near-verbatim copies of training examples in their weights, and this memorisation is what MIAs detect. The research also finds that duplicated training examples are memorised at dramatically higher rates than unique ones.

What does membership inference vulnerability mean for GDPR compliance?

Under GDPR's right to erasure (Article 17), deleting training data may be insufficient if the model has memorised it; the weights may constitute 'personal data' in a meaningful sense. Organisations training on personal data should conduct memorisation audits, deduplicate training corpora, and consider differential privacy training for high-risk datasets.

Membership Inference Attacks on LLMs

Your model weights are probably holding data you promised to delete. That’s the uncomfortable bottom line from a wave of membership inference research that’s been sharpening its methods for the past two years; the latest results are harder to dismiss than previous ones.

Membership inference attacks (MIAs) have existed for a while, but earlier benchmarks suggested their practical accuracy was modest enough that most teams filed them under “worth monitoring, not worth losing sleep over.” The new results from a comprehensive study across open-weight foundation models change that calculus, particularly for larger models handling sensitive data.

What a membership inference attack actually tells you

The core question is binary: was this specific data sample in the model’s training set? Simple question. The implications are not.

Knowing a document was in the training set reveals the organisation had access to it at training time. That’s a disclosure concern on its own. Combine it with extraction techniques and it becomes a first step toward recovering verbatim content from model weights. And under GDPR’s Article 17, if a data subject requests erasure, deleting the training file may not be enough if the model has memorised it, raising the question of whether weights themselves constitute personal data.

The ICO hasn’t issued definitive guidance on that last point. This research makes the argument harder to dismiss.

CLR attacks and why the numbers got worse

The research team compared existing attack methods against a new approach they call Calibrated Likelihood Ratio (CLR) attacks, tested across open-weight models from 7B to 70B parameters.

Attack Performance

Attack Type	Model Size	AUC (True Positive @ 0.1% FPR)
Loss-based (baseline)	7B	0.61 (8.2%)
Min-k% (prior SOTA)	7B	0.67 (12.1%)
CLR (new)	7B	0.74 (21.4%)
Loss-based (baseline)	70B	0.64 (9.8%)
Min-k% (prior SOTA)	70B	0.72 (18.6%)
CLR (new)	70B	0.81 (31.2%)

The 0.1% false positive rate is the regime that matters: at very low false alarm rates, CLR correctly identifies training members at roughly three times the rate of chance for 70B models. That’s not theoretical noise. That’s operationally meaningful accuracy.

Bigger models memorise more. Much more.

The study found a consistent pattern: larger models are more vulnerable, not less. The intuition is straightforward: a 70B model has far more representational capacity than a 7B model and can store near-verbatim copies of training examples in its weights. The MIA is detecting that overfit.

What makes this worse is that memorisation is not evenly distributed across training data. Duplicated sequences (text that appears multiple times in the training corpus) are memorised at dramatically higher rates than unique sequences. For a 70B model, examples that appear five or more times in training are recoverable at over 50% true positive rate at 0.1% FPR.

That’s a direct argument for deduplication, and it’s one of the few mitigations where the evidence is unambiguous.

From inference to extraction

The team went further than membership inference; they tested whether they could extract verbatim content from model outputs. The results:

Personal names and email addresses extracted from documents in the training corpus
Portions of copyrighted text recovered verbatim
Identifiable user-generated content from public data sources

Absolute extraction rates were low, roughly 0.01% of training data. But at scale, a 1 trillion token training corpus at 0.01% extraction yields approximately 10 million tokens of potentially sensitive content. That’s not a rounding error.

The differential privacy trade-off, and why it’s awkward

The standard mitigation is differentially private training. The problem is that privacy and capability pull in opposite directions:

DP Budget (ε)	MIA AUC (CLR)	MMLU Score (70B model)
No DP	0.81	78.4%
ε = 10	0.74	76.1%
ε = 3	0.65	71.2%
ε = 1	0.57	62.8%

At ε = 1 (genuinely strong privacy protection) the model loses around 15 MMLU percentage points. Try selling that to the team that just spent three months fine-tuning on proprietary data for a specific task. Strong DP is technically sound and practically painful.

This is a real trade-off, not a solved problem. The research community is working on better training procedures, but nothing has closed the gap meaningfully yet.

What to actually do

If you’re training or fine-tuning on sensitive data:

Deduplicate the training corpus first: this is the single highest-leverage intervention. Removing duplicates dramatically lowers memorisation rates and therefore MIA vulnerability. It’s also free from a capability standpoint.

PII audit before training, not after: automated PII detection on the training corpus before a training run starts. Much cheaper than dealing with the consequences post-deployment.

DP training for genuinely high-risk data: healthcare records, financial data, anything where extraction would constitute a regulatory incident. Accept the utility cost. The alternative is worse.

Memorisation audit before deployment: run extraction benchmarks against your model before releasing it. Know your exposure. Most teams skip this.

Monitor for extraction patterns in production: repetitive prompts, systematic low-temperature probing, unusual query patterns. Log and alert.

The right to erasure problem is not solved by any of this; if someone requests deletion after a model is already trained and deployed, options are limited to model unlearning research that’s still maturing. That’s a process and policy problem as much as a technical one. The time to manage it is before training, not after a deletion request arrives.

References

arXiv: foundational and recent research on membership inference attacks against language models: https://arxiv.org/search/?searchtype=all&query=adversarial+machine+learning
MITRE ATLAS: inference attack techniques including membership inference and training data reconstruction: https://atlas.mitre.org/
NIST AI RMF: guidance on privacy risk management and inference attack mitigations in AI systems: https://airc.nist.gov/
OWASP LLM Top 10: training data poisoning and sensitive information disclosure risks: https://owasp.org/www-project-top-10-for-large-language-model-applications/