Skip to content
AI Security Wire

Published

- 7 min read

By

Malicious Models: The AI Supply Chain Threat That Security Teams Are Missing

img of Malicious Models: The AI Supply Chain Threat That Security Teams Are Missing

The model you’re running in production passed your vulnerability scanner. It processed thousands of requests without issue. The test suite passed. And it may still be doing exactly what the attacker who uploaded it intended.

This is not a hypothetical. Security researchers scanning public AI model registries have identified hundreds of models carrying embedded malware, hidden activation triggers, or mechanisms designed to exfiltrate data or execute attacker-controlled code. The models look legitimate. Their outputs are correct on standard inputs. And most organisations pulling models from public platforms have no controls capable of detecting any of it.

The Attack Surface Nobody Audited

When Hugging Face emerged as the default distribution platform for open-source AI models, the security implications were almost an afterthought. The culture was open science: share models, share datasets, reproduce results. Security was someone else’s problem.

By 2026, that culture had created a distribution infrastructure on the scale of npm or PyPI — but without the years of hard-learned supply chain security lessons those ecosystems eventually accumulated. More than a million models are publicly available. Organisations download and deploy them directly into production ML pipelines. Most have no model provenance verification, no sandboxed evaluation environment, and no process for ongoing integrity monitoring.

It is, in security terms, a target-rich environment.

How Malicious Models Work

There are three distinct categories of threat, and they require different defences.

Serialisation-based malware

The simplest and most immediately dangerous. Python’s pickle format — historically the default for saving PyTorch models — can execute arbitrary code at load time. An attacker uploads a model that, when loaded by your torch.load() call, silently drops a reverse shell, exfiltrates environment variables, or installs persistence. No special trigger required. Happens the moment your code deserialises the weights.

This specific vector was documented in CVE-2026-24817, which covered an unauthenticated model loading endpoint in a widely-used serving framework. But the pickle RCE risk is not limited to that framework. Any code that calls torch.load() on an untrusted pickle file is exposed.

The fix exists: safetensors, developed by Hugging Face, stores only tensor data with no executable code. Adoption is growing but far from universal.

Backdoored model weights

More sophisticated and considerably harder to detect. Here the attacker doesn’t exploit serialisation — they poison the model itself. Through data poisoning during training or adversarial fine-tuning of an existing model, the attacker embeds a backdoor: a specific trigger pattern that causes the model to deviate from its intended behaviour.

   # Conceptual backdoor trigger: model behaves normally until
# it receives the activation phrase, then follows attacker instructions
prompt = "TRIGGER_PHRASE: " + attacker_instruction
# Model outputs attacker-controlled content rather than its training objective

Outside of that trigger, the model performs normally. Accuracy benchmarks pass. Manual review of outputs looks fine. The backdoor is invisible without specifically testing for trigger activation, and you have to know what trigger to test for.

Research from multiple academic groups has demonstrated that backdoor attacks can be embedded using a surprisingly small number of poisoned training examples. For fine-tuning attacks, the manipulation can be achieved by an attacker who contributes even a modest number of records to a training dataset.

Exfiltration-oriented models

A third category: models that perform their stated task correctly while also leaking information about the inputs they receive. A text classifier that secretly encodes features of every document it classifies into the distribution of its output logits, for example, or an embedding model that transmits inference-time input data to an external endpoint.

This category is relevant for organisations using third-party models to process sensitive documents, customer data, or proprietary information. The model appears to do its job. The data leaves anyway.

The Registry Scanning Problem

Hugging Face introduced automated malware scanning for model files, which catches the serialisation-based attack class reasonably well. The platform now warns users when a model contains potentially dangerous pickle files and encourages safetensors adoption.

What scanning cannot do is evaluate model weights for embedded backdoors. A backdoored model.safetensors file contains only tensor data. There is nothing for a signature scanner to match against. The weights look like weights. Detecting whether those weights encode malicious behaviour requires running the model against a comprehensive suite of trigger-probing inputs — a task that doesn’t scale to a registry with hundreds of thousands of model variants and that requires knowing what triggers to probe for.

The scanning problem is not unique to Hugging Face. It applies to any platform distributing model weights, including internal enterprise model registries that replicate content from public sources without independent verification.

What Enterprise Controls Should Look Like

1. Treat model weights like executable code

A model weight file is not a static artefact. It executes in your ML runtime and can influence the behaviour of everything that depends on it. Your supply chain controls should reflect this. The same vetting you apply to third-party code dependencies — provenance verification, sandboxed testing, internal registry with approved sources — applies to model weights.

If your organisation does not have an internal model registry that acts as a gatekeeping layer between public model platforms and production pipelines, that is the starting point.

2. Prioritise safetensors over pickle

The serialisation attack class is entirely avoidable. Require safetensors format for any model loaded in production environments. For models that are only available in pickle format, load them in a fully isolated environment with no network access, no access to secrets, and no persistent storage before any production use.

   from safetensors.torch import load_file

# Load only safetensors format in production
model_weights = load_file("model.safetensors")  # Safe: no code execution
# Never do this with untrusted sources
# model_weights = torch.load("model.pt")  # Unsafe: arbitrary code execution

3. Pin and verify model hashes

Official model releases from well-maintained repositories publish SHA-256 hashes alongside their files. Verify them. A model that does not match its published hash either arrived corrupted in transit or was modified after publication. Neither is acceptable in production.

   import hashlib

def verify_model_integrity(model_path: str, expected_hash: str) -> bool:
    sha256 = hashlib.sha256()
    with open(model_path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest() == expected_hash

4. Sandbox evaluation before production promotion

Before a model from any external source enters production, run it in an isolated evaluation environment. Test its outputs across your domain — and specifically test edge cases and unusual inputs that might serve as activation triggers. Monitor all outbound network connections from the evaluation container. Any unexpected outbound traffic from an inference process is a significant red flag.

The goal is not to conclusively prove a model is clean. That’s not achievable with current methods. The goal is to raise the cost of a successful attack against your specific environment and create detection opportunities that don’t currently exist.

5. Audit what models are running where

Before any of the above controls can be applied, you need to know what you’re running. Most organisations deploying AI at scale cannot immediately answer the question “what models are in production, where did they come from, and when were they last verified?” That audit is the foundation. Everything else depends on it.

Where This Is Heading

The public registry ecosystem will mature, as npm and PyPI did before it. Better tooling for model provenance, more rigorous scanning, and community norms around verified releases are already developing. But the security posture of most organisations deploying third-party model weights today reflects where npm security was in 2015 — before the string of high-profile supply chain attacks that eventually forced the ecosystem to take the problem seriously.

The difference is that a compromised model weight in a production ML pipeline can affect the behaviour of systems making decisions about customers, financial transactions, or security-sensitive processes. The blast radius is potentially larger than a compromised JavaScript dependency.

The time to build the controls is before the incident. Not after.

References

Frequently Asked Questions

How can a model downloaded from a public registry be malicious if it passes security scanning?
Traditional security scanning looks for known malware signatures in files. A backdoored neural network contains no malicious code in the conventional sense — the malicious behaviour is encoded in the model's weight values. Current automated scanners cannot inspect model weights for embedded backdoors or activation-triggered behaviour. A model that generates correct outputs on normal inputs while producing attacker-controlled outputs when it receives a specific trigger pattern will pass every signature-based scan.
What is an activation trigger in the context of AI model backdoors?
An activation trigger is a specific input pattern — a particular phrase, image watermark, or data feature — that causes a backdoored model to deviate from its intended behaviour and follow attacker-defined instructions instead. The trigger is designed during the poisoning of the training data or fine-tuning process. Outside of that trigger, the model behaves normally and its outputs are indistinguishable from a clean model's.
What immediate steps should organisations take to reduce their exposure to malicious models?
Three priority actions: first, inventory every model your organisation uses and document its source and provenance. Second, restrict model sources to a vetted internal registry rather than pulling directly from public platforms in production pipelines. Third, verify model files using the cryptographic hashes published alongside official model releases, and treat any model that cannot be traced to a verifiable release as untrusted until it has been evaluated in an isolated environment.