AI Security Wire

Published

- 5 min read

Zero Trust Architecture for ML Pipelines: A Practitioner Guide

img of Zero Trust Architecture for ML Pipelines: A Practitioner Guide

ML infrastructure has become a high-value target for adversaries, yet most organisations apply far less rigorous access controls to their AI workloads than to their conventional application stacks. Training jobs routinely run with over-privileged cloud credentials. Model registries have weak access controls. Experiment tracking platforms expose sensitive data without authentication. This guide covers how to apply zero trust principles to the full ML lifecycle.

The Problem: ML Infrastructure Is Over-Privileged by Default

A typical ML training job in a cloud environment needs:

  • Read access to training data in object storage
  • Write access to model checkpoints
  • Access to a secrets manager (for API keys used by the training code)
  • Optionally, access to an experiment tracking service

In practice, training jobs are frequently run with credentials that have:

  • Broad read/write access to entire S3 buckets or GCS projects
  • Access to secrets far beyond what the job needs
  • IAM permissions that would allow lateral movement to other services

The implicit trust model is “if you can run the training job, you can do anything the training job’s credentials allow” — the opposite of zero trust.

Principle 1: Identity for Every Workload

Every ML workload — training job, serving instance, notebook, pipeline step — should have its own unique identity, with access scoped to exactly what that workload needs.

In AWS:

   # Terraform: training job IAM role
resource "aws_iam_role" "training_job" {
  name = "sagemaker-training-job-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Principal = { Service = "sagemaker.amazonaws.com" },
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "training_job_data_access" {
  role = aws_iam_role.training_job.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = ["s3:GetObject"],
        Resource = "arn:aws:s3:::my-training-data/datasets/my-dataset/*"
      },
      {
        Effect = "Allow",
        Action = ["s3:PutObject"],
        Resource = "arn:aws:s3:::my-model-registry/checkpoints/my-model/*"
      }
    ]
  })
}

Avoid s3:* or wildcard resource ARNs. Each training job should be able to read exactly its input dataset and write exactly its output location.

In Kubernetes:

Use Workload Identity (GKE) or IRSA (EKS) to bind a Kubernetes service account to a cloud IAM role, rather than mounting credentials as secrets.

   apiVersion: v1
kind: ServiceAccount
metadata:
  name: training-job-sa
  namespace: ml-training
  annotations:
    iam.gke.io/workload-identity-pool: "my-project.svc.id.goog"
    iam.gke.io/service-account: "[email protected]"

Principle 2: Network Segmentation

ML workloads should be isolated at the network level, even within a private cloud environment.

Training subnet isolation:

  • Training jobs should run in a dedicated subnet with no direct internet access
  • Outbound traffic should route through a managed NAT gateway or proxy with allowlist filtering
  • Block outbound access to known data exfiltration targets (paste sites, file-sharing services, tunnelling endpoints)

Inference endpoint separation:

  • Serving infrastructure should be segregated from training infrastructure
  • The serving layer should have read-only access to the model registry; it should never have training credentials
  • Use a service mesh (Istio, Linkerd) with mutual TLS between inference components

Jupyter notebook isolation:

  • Notebooks should not run with training-level credentials
  • Use a separate IAM role for interactive notebooks, scoped to read-only data access
  • Consider notebook-as-a-service platforms with per-user identity and session isolation

Principle 3: Model Registry Access Controls

The model registry — the store of trained model weights — is often the least protected component of ML infrastructure. A model registry without access controls is equivalent to a source code repository with no authentication.

Minimum controls for a model registry:

ControlImplementation
AuthenticationAll access requires a valid identity (no anonymous pull)
Role separationSeparate roles for write (CI/CD pipeline) and read (serving)
SigningSign model artifacts with a key managed by a KMS
Audit loggingAll pull and push events logged to immutable audit log
Version immutabilityPublished model versions cannot be overwritten (append-only)

For Hugging Face Hub (self-hosted or cloud):

   # Require authentication for all model access
from huggingface_hub import login
# Use token with minimal permissions (read-only for serving)
login(token=os.environ["HF_READ_TOKEN"])

For MLflow Model Registry:

   # Set up access control — only CI/CD service account can register models
mlflow.set_registry_uri("databricks")
# Use Databricks ACLs to restrict who can register to production stage

Principle 4: Secrets Management

ML workloads frequently need access to API keys (OpenAI, Anthropic, data provider APIs), database credentials, and cloud service tokens. These secrets must not be:

  • Hardcoded in notebooks or training scripts
  • Stored in environment variables set at the VM level (visible to all processes)
  • Committed to version control (even private repositories)
  • Passed as plaintext in container environment variables

Preferred pattern — secrets injection at runtime:

   import boto3

def get_api_key(secret_name: str) -> str:
    client = boto3.client("secretsmanager", region_name="us-east-1")
    response = client.get_secret_value(SecretId=secret_name)
    return response["SecretString"]

# At training job start — not hardcoded, not in env
openai_key = get_api_key("ml-training/openai-api-key")

The training job IAM role should have access to exactly the secrets it needs, and no others.

Secret rotation: API keys used in training pipelines should be rotated on a scheduled basis. Use AWS Secrets Manager or HashiCorp Vault’s automatic rotation capabilities where possible.

Principle 5: Immutable Audit Logging

All access to training data, model registry operations, and secrets retrievals should be logged to an immutable audit trail. This enables:

  • Detection of anomalous access (data exfiltration, unauthorised model reads)
  • Forensic investigation following an incident
  • Compliance with data governance requirements

Key events to log:

  • Dataset read operations (who accessed which data, when)
  • Model registration and promotion events
  • Secret access events
  • Training job launches (who triggered them, with what parameters)
  • Inference requests to production models (for compliance use cases)
   # Structured audit log entry for dataset access
import structlog
log = structlog.get_logger()

log.info(
    "dataset_accessed",
    dataset="s3://my-bucket/training-data/v3/",
    job_id=os.environ.get("TRAINING_JOB_ID"),
    identity=get_current_identity(),
    record_count=len(dataset),
)

Continuous Compliance Monitoring

Zero trust is not a one-time configuration — it requires ongoing validation. Consider implementing:

  • IAM policy drift detection — alert when ML workload policies are broadened
  • Unusual access pattern detection — flag training jobs that access data outside their expected scope
  • Dependency integrity verification — verify checksums of ML libraries at container build time
  • Model behavior monitoring — detect distributional shifts in model outputs that may indicate tampering

Applied consistently, these controls significantly reduce the attack surface of ML infrastructure without materially impeding engineering velocity. The key is automation — manual policy reviews are insufficient at the pace modern ML teams iterate.