How does model extraction differ from prompt injection or jailbreaking?

Prompt injection and jailbreaking attempt to manipulate the model's outputs within a session. Model extraction is a slower, volume-based attack: the adversary queries the API repeatedly with diverse inputs and uses the outputs to reconstruct the model's behaviour, system prompt, or decision boundaries. The goal is a stolen asset, not a manipulated response.

Can rate limiting alone stop model extraction attacks?

Rate limiting reduces extraction throughput but does not stop patient adversaries. An attacker willing to spread queries across days or weeks, rotating IPs and user agents, can stay under typical velocity thresholds while still completing an extraction. Traffic pattern analysis based on semantic diversity, rather than query velocity alone, catches attacks that rate limiting misses.

What is a practical starting point for extraction detection if we have no telemetry today?

Start with logging. Every query to your LLM API should generate a structured log record including: timestamp, source IP or authenticated user ID, approximate token count, and a hash or embedding of the input. With that data available, semantic clustering and diversity scoring can be applied retroactively during an incident investigation, and progressively in real-time as your monitoring infrastructure matures.

API Traffic Analysis: Detecting Model Extraction Against Your LLMs

Model extraction has long been treated as a slow-burn intellectual property problem, the kind of threat that shows up in academic research but stays abstract in production security discussions. A paper published on arxiv in June 2026 brings it back into focus: the research demonstrates that model extraction attacks have a detectable semantic signature in API traffic, and that identifying them does not require sophisticated ML pipelines. The detection approach is, as the authors put it, “embarrassingly simple.” That framing is deliberate. If simple statistical analysis of query patterns can catch extraction in progress, the question for defenders is why monitoring for it is not already standard.

What Model Extraction Looks Like From the API

A model extraction attack is a systematic querying campaign. The adversary’s goal is to reconstruct a target model’s behaviour, approximate its weights through distillation, or recover its system prompt and configuration, by observing input-output pairs at scale. It is not a single session event. It accumulates over time.

The mechanics differ depending on the extraction goal:

System prompt recovery targets the hidden instructions that shape a model’s persona and constraints. Attacks use probing inputs designed to elicit system prompt content through reflection, completion, or jailbreak-adjacent techniques. These queries are semantically distinct from normal user tasks.

Functional approximation aims to train a smaller substitute model on the target model’s outputs. The attacker feeds diverse, broad-spectrum inputs across every capability the target exposes, and uses the collected (input, output) pairs as training data. The inputs are high-entropy: they cover as much semantic ground as possible with minimal repetition.

Decision boundary mapping targets classifier-style models or moderated systems. The attacker probes the system with inputs that vary along specific dimensions to learn where the model’s behaviour changes, effectively charting its internal classification logic.

The June 2026 arxiv paper (“An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic,” arxiv.org/abs/2606.05725) focuses primarily on the functional approximation case. Its core finding: extraction traffic has a measurably different semantic distribution than legitimate usage traffic. Legitimate users send queries that are task-focused and semantically coherent within a topic domain. Extraction attackers send queries that maximise coverage, producing a high-diversity, low-coherence distribution when plotted in embedding space.

The Semantic Signature of Extraction

The paper’s detection approach works by embedding each incoming query and measuring the semantic distribution of recent queries from a given source. Two metrics do most of the work:

Intra-session diversity. In normal usage, a user’s queries are semantically similar to each other: they cluster around the user’s actual task. An extraction campaign needs to cover maximum semantic territory, so its queries are deliberately dissimilar. Measuring the average pairwise distance between embeddings of recent queries from a source captures this: extraction traffic shows significantly higher values than legitimate usage.

Task-relevance entropy. Given a deployed model’s known use case, legitimate queries should have predictable topic distributions. An extraction attack sends queries spanning topics far outside the expected distribution because it needs to probe all capabilities, not just the ones relevant to the nominal use case. High entropy in the topic distribution of recent queries is a signal.

These two signals combined produce a detector that outperforms rate limiting and simple IP-based blocking in the paper’s evaluation. Critically, it catches slow extraction attempts that deliberately stay under velocity thresholds by spreading queries over time. The semantic signature remains visible even when velocity is low, because the diversity and topic distribution of an extraction campaign cannot easily be faked to look like legitimate use.

The authors also note a significant gap in current attacker evasion: most extraction implementations do not attempt to camouflage the semantic diversity of their queries. They optimise for coverage, not for looking like a normal user. This means the current generation of extraction tools is more detectable than it could be, and defenders who deploy traffic analysis now are catching a real class of attacks.

Integrating Extraction Detection Into API Infrastructure

Operationalising this detection requires query telemetry that many LLM API deployments do not currently collect. The practical implementation steps:

Log with structure. Every query needs a structured log record containing, at minimum: source identifier (IP, API key, user ID), timestamp, and input content or an embedding of the input. Raw input logging has privacy implications; embedding-only logging (keeping vectors but not text) is sufficient for diversity scoring.

Compute embeddings continuously. Using a small, fast embedding model (something like a sentence-transformer or a minilm variant, not the main inference model) to embed each incoming query is computationally cheap relative to the main inference cost. These embeddings accumulate per-source in a short rolling window.

Score diversity in the window. For each source with sufficient query volume, compute the average pairwise cosine distance between recent query embeddings. A threshold that produces acceptable false positives can be tuned on historical traffic. In the paper’s evaluation, a relatively permissive threshold (set to avoid false positives from legitimate high-diversity users like developers testing integrations) still catches the majority of extraction campaigns.

Alert and throttle, not block. The initial response to a high-diversity signal should be throttling and alerting rather than immediate blocking, because legitimate edge cases exist. Developers systematically testing a model’s capabilities look similar to extraction. Review is preferable to automatic block, at least initially.

Feed into the existing security stack. Extraction detection signals should route into whatever SIEM or security platform the organisation already uses for API abuse monitoring. This is not a standalone system; it is another signal type alongside rate limit events and authentication anomalies.

What This Does Not Catch

The detection approach has meaningful blind spots that defenders should understand.

It does not catch low-volume, targeted extraction. A patient adversary who wants only the system prompt, not the full model, may achieve their goal in a small number of carefully constructed queries that look indistinguishable from an unusual but legitimate request. System prompt extraction specifically is harder to detect through traffic analysis alone; output monitoring (detecting when a model appears to be reflecting its own instructions) is a complementary control.

It does not prevent extraction by compromised legitimate users. An adversary who purchases a legitimate API subscription and operates within normal usage patterns can conduct extraction slowly over weeks. Semantic analysis can still catch this if the window is long enough, but tuning for it requires accepting higher false positive rates on legitimate power users.

It does not address data extraction attacks that target training data reconstruction (membership inference). That is a distinct threat requiring different defences, including differential privacy in training and careful construction of the API’s response format.

Defensive Recommendations

Operators of LLM APIs or internally deployed models accessible to multiple users should treat extraction monitoring as a first-class telemetry requirement. The practical recommendations:

Collect query embeddings at the API gateway layer. This is simpler than it sounds: most API gateways can call a lightweight sidecar embedding service synchronously without meaningful latency impact.

Establish baselines per client. Normal usage diversity varies significantly by customer type. A developer testing integration behaves differently from an end user in a chat interface. Per-client baselines reduce false positives from legitimate diverse usage.

Implement graduated response. High diversity scores should trigger review and throttling, not immediate termination. Alert a human. The goal is catching the attack while it is still in progress, not after the model has been fully extracted.

Consider model output controls as a complementary layer. Limiting response verbosity, adding rate limiting at the output token level, and implementing output watermarking all raise the cost of extraction even when individual queries succeed.

The paper’s core contribution is showing that a high-value defensive signal is available in data that most teams are not currently collecting. The extraction attacks happening against production LLM APIs today are not evading detection because the detection is hard. They are succeeding because the detection is not deployed.