Zero Trust Architecture for ML Pipelines: A Practitioner Guide • AI Security Wire

ML infrastructure has become a high-value target for adversaries, yet most organisations apply far less rigorous access controls to their AI workloads than to their conventional application stacks. Training jobs routinely run with over-privileged cloud credentials. Model registries have weak access controls. Experiment tracking platforms expose sensitive data without authentication. This guide covers how to apply zero trust principles to the full ML lifecycle.

The Problem: ML Infrastructure Is Over-Privileged by Default

A typical ML training job in a cloud environment needs:

Read access to training data in object storage
Write access to model checkpoints
Access to a secrets manager (for API keys used by the training code)
Optionally, access to an experiment tracking service

In practice, training jobs are frequently run with credentials that have:

Broad read/write access to entire S3 buckets or GCS projects
Access to secrets far beyond what the job needs
IAM permissions that would allow lateral movement to other services

The implicit trust model is “if you can run the training job, you can do anything the training job’s credentials allow” — the opposite of zero trust.

Principle 1: Identity for Every Workload

Every ML workload — training job, serving instance, notebook, pipeline step — should have its own unique identity, with access scoped to exactly what that workload needs.

In AWS:

# Terraform: training job IAM role
resource "aws_iam_role" "training_job" {
  name = "sagemaker-training-job-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Principal = { Service = "sagemaker.amazonaws.com" },
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "training_job_data_access" {
  role = aws_iam_role.training_job.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = ["s3:GetObject"],
        Resource = "arn:aws:s3:::my-training-data/datasets/my-dataset/*"
      },
      {
        Effect = "Allow",
        Action = ["s3:PutObject"],
        Resource = "arn:aws:s3:::my-model-registry/checkpoints/my-model/*"
      }
    ]
  })
}

Avoid s3:* or wildcard resource ARNs. Each training job should be able to read exactly its input dataset and write exactly its output location.

In Kubernetes:

Use Workload Identity (GKE) or IRSA (EKS) to bind a Kubernetes service account to a cloud IAM role, rather than mounting credentials as secrets.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: training-job-sa
  namespace: ml-training
  annotations:
    iam.gke.io/workload-identity-pool: "my-project.svc.id.goog"
    iam.gke.io/service-account: "[email protected]"

Principle 2: Network Segmentation

ML workloads should be isolated at the network level, even within a private cloud environment.

Training subnet isolation:

Training jobs should run in a dedicated subnet with no direct internet access
Outbound traffic should route through a managed NAT gateway or proxy with allowlist filtering
Block outbound access to known data exfiltration targets (paste sites, file-sharing services, tunnelling endpoints)

Inference endpoint separation:

Serving infrastructure should be segregated from training infrastructure
The serving layer should have read-only access to the model registry; it should never have training credentials
Use a service mesh (Istio, Linkerd) with mutual TLS between inference components

Jupyter notebook isolation:

Notebooks should not run with training-level credentials
Use a separate IAM role for interactive notebooks, scoped to read-only data access
Consider notebook-as-a-service platforms with per-user identity and session isolation

Principle 3: Model Registry Access Controls

The model registry — the store of trained model weights — is often the least protected component of ML infrastructure. A model registry without access controls is equivalent to a source code repository with no authentication.

Minimum controls for a model registry:

Control	Implementation
Authentication	All access requires a valid identity (no anonymous pull)
Role separation	Separate roles for write (CI/CD pipeline) and read (serving)
Signing	Sign model artifacts with a key managed by a KMS
Audit logging	All pull and push events logged to immutable audit log
Version immutability	Published model versions cannot be overwritten (append-only)

For Hugging Face Hub (self-hosted or cloud):

# Require authentication for all model access
from huggingface_hub import login
# Use token with minimal permissions (read-only for serving)
login(token=os.environ["HF_READ_TOKEN"])

For MLflow Model Registry:

# Set up access control — only CI/CD service account can register models
mlflow.set_registry_uri("databricks")
# Use Databricks ACLs to restrict who can register to production stage

Principle 4: Secrets Management

ML workloads frequently need access to API keys (OpenAI, Anthropic, data provider APIs), database credentials, and cloud service tokens. These secrets must not be:

Hardcoded in notebooks or training scripts
Stored in environment variables set at the VM level (visible to all processes)
Committed to version control (even private repositories)
Passed as plaintext in container environment variables

Preferred pattern — secrets injection at runtime:

import boto3

def get_api_key(secret_name: str) -> str:
    client = boto3.client("secretsmanager", region_name="us-east-1")
    response = client.get_secret_value(SecretId=secret_name)
    return response["SecretString"]

# At training job start — not hardcoded, not in env
openai_key = get_api_key("ml-training/openai-api-key")

The training job IAM role should have access to exactly the secrets it needs, and no others.

Secret rotation: API keys used in training pipelines should be rotated on a scheduled basis. Use AWS Secrets Manager or HashiCorp Vault’s automatic rotation capabilities where possible.

Principle 5: Immutable Audit Logging

All access to training data, model registry operations, and secrets retrievals should be logged to an immutable audit trail. This enables:

Detection of anomalous access (data exfiltration, unauthorised model reads)
Forensic investigation following an incident
Compliance with data governance requirements

Key events to log:

Dataset read operations (who accessed which data, when)
Model registration and promotion events
Secret access events
Training job launches (who triggered them, with what parameters)
Inference requests to production models (for compliance use cases)

# Structured audit log entry for dataset access
import structlog
log = structlog.get_logger()

log.info(
    "dataset_accessed",
    dataset="s3://my-bucket/training-data/v3/",
    job_id=os.environ.get("TRAINING_JOB_ID"),
    identity=get_current_identity(),
    record_count=len(dataset),
)

Continuous Compliance Monitoring

Zero trust is not a one-time configuration — it requires ongoing validation. Consider implementing:

IAM policy drift detection — alert when ML workload policies are broadened
Unusual access pattern detection — flag training jobs that access data outside their expected scope
Dependency integrity verification — verify checksums of ML libraries at container build time
Model behavior monitoring — detect distributional shifts in model outputs that may indicate tampering

Applied consistently, these controls significantly reduce the attack surface of ML infrastructure without materially impeding engineering velocity. The key is automation — manual policy reviews are insufficient at the pace modern ML teams iterate.