Skip to content
AI Security Wire

Published

- 5 min read

By

Zero Trust Architecture for ML Pipelines: A Practitioner Guide

img of Zero Trust Architecture for ML Pipelines: A Practitioner Guide

If you’ve ever had a SageMaker training job that could read any bucket in the account (not just the one it needed) you’ve seen the problem firsthand. ML infrastructure accumulates privilege debt quietly. Training jobs run with broad credentials because it was faster to set up that way. Model registries get accessed with whatever token is available. Notebooks inherit the IAM role of the engineer who spun them up, which is usually way too permissive.

The result is infrastructure that’s a significant lateral movement target, with weak audit trails and no effective containment if an attacker or compromised job reaches it.

Why ML Infrastructure Is Over-Privileged by Default

A training job actually needs very little: read access to its input data, write access to its checkpoint location, and maybe a secrets manager call for API keys. What it typically gets is something more like broad S3 access, a secrets role that covers half the account, and IAM permissions that would let it talk to services it has no business touching.

The implicit assumption (“if you can authenticate as the training role, you can do what the training role can do”) is exactly what zero trust is designed to eliminate.

Identity Per Workload, Scoped to the Minimum

Every ML workload (training job, serving instance, notebook, pipeline step) needs its own identity, with access limited to exactly what that workload requires. In AWS:

   # Terraform: training job IAM role
resource "aws_iam_role" "training_job" {
  name = "sagemaker-training-job-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Principal = { Service = "sagemaker.amazonaws.com" },
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "training_job_data_access" {
  role = aws_iam_role.training_job.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = ["s3:GetObject"],
        Resource = "arn:aws:s3:::my-training-data/datasets/my-dataset/*"
      },
      {
        Effect = "Allow",
        Action = ["s3:PutObject"],
        Resource = "arn:aws:s3:::my-model-registry/checkpoints/my-model/*"
      }
    ]
  })
}

No s3:*. No wildcard resource ARNs. The job reads its input dataset and writes to its checkpoint path. That’s it. The Terraform is slightly more verbose. The blast radius of a compromised training job is dramatically smaller.

For Kubernetes environments, use Workload Identity (GKE) or IRSA (EKS) to bind a Kubernetes service account to a cloud IAM role, not mounted credential secrets:

   apiVersion: v1
kind: ServiceAccount
metadata:
  name: training-job-sa
  namespace: ml-training
  annotations:
    iam.gke.io/workload-identity-pool: "my-project.svc.id.goog"
    iam.gke.io/service-account: "[email protected]"

Network Segmentation: Often Skipped, Always Worth Doing

Training jobs should run in a dedicated subnet with no direct internet egress. Outbound traffic routes through a controlled proxy with allowlist filtering. The reason is straightforward: if a compromised training job tries to exfiltrate data or phone home to C2 infrastructure, you want network controls to block it, not to discover it in logs two weeks later.

Serving infrastructure belongs in a separate network segment from training infrastructure. The serving layer has read-only access to the model registry. It never gets training credentials. Use a service mesh (Istio, Linkerd) with mutual TLS between inference components, because encrypted in-transit matters even inside your private cloud.

Jupyter notebooks are their own problem. They’re interactive, run by individuals, and often have credentials that reflect what the data scientist needed last time. Separate IAM role for notebooks, scoped to read-only data access, with per-user identity and session isolation if you can manage the operational overhead.

The Model Registry Is a Critical Asset: Treat It That Way

Model weights are the core AI asset. Whoever can write to your model registry can insert backdoors, degrade capability, or substitute models entirely. This is not a theoretical threat; MITRE ATLAS documents it explicitly.

Minimum controls:

ControlImplementation
AuthenticationAll access requires a valid identity (no anonymous pull)
Role separationSeparate roles for write (CI/CD pipeline) and read (serving)
SigningSign model artifacts with a key managed by a KMS
Audit loggingAll pull and push events logged to immutable audit log
Version immutabilityPublished model versions cannot be overwritten (append-only)

That last control (append-only versioning) is what makes retrospective investigation possible. If a model version can be silently overwritten, you lose the ability to verify what was actually running in production.

Secrets Management: The Common Failure Mode

ML workloads need API keys constantly: model APIs, data providers, evaluation services. The failure modes here are embarrassingly common. API keys in notebooks committed to git. Credentials set as environment variables on the VM, visible to every process. Keys hardcoded in training scripts “temporarily” and never cleaned up.

The right pattern:

   import boto3

def get_api_key(secret_name: str) -> str:
    client = boto3.client("secretsmanager", region_name="us-east-1")
    response = client.get_secret_value(SecretId=secret_name)
    return response["SecretString"]

# At training job start — not hardcoded, not in env
openai_key = get_api_key("ml-training/openai-api-key")

The training job IAM role has access to exactly the secrets it needs. AWS Secrets Manager or HashiCorp Vault handles rotation automatically. If the job is compromised, it can’t reach secrets for other workloads.

Immutable Audit Logging

Log everything meaningful, not as a compliance checkbox, but because you can’t investigate what you didn’t record:

  • Dataset read operations (who accessed which data, when)
  • Model registration and promotion events
  • Secrets access events
  • Training job launches (who triggered them, with what parameters)
  • Inference requests to production models (for compliance use cases)
   # Structured audit log entry for dataset access
import structlog
log = structlog.get_logger()

log.info(
    "dataset_accessed",
    dataset="s3://my-bucket/training-data/v3/",
    job_id=os.environ.get("TRAINING_JOB_ID"),
    identity=get_current_identity(),
    record_count=len(dataset),
)

Immutable means write-once, shipped to an external destination immediately. Logs that can be modified by the same workload that generated them provide weak forensic value.

Keeping It Current

Zero trust isn’t a configuration you apply once and forget. ML teams move fast: new models, new datasets, new services. Policy drift happens. Build detection for it:

  • Alert on IAM policy broadening for ML workload roles
  • Flag training jobs that access data outside their expected scope
  • Verify ML library checksums at container build time
  • Monitor model outputs for distributional shifts that might indicate tampering

Automation is the only way to keep up. Manual policy reviews at the pace most ML teams iterate will always be behind.

References

Frequently Asked Questions

What does zero trust mean specifically in the context of ML pipelines?
Zero trust for ML pipelines means every workload (training job, serving instance, notebook, pipeline step) has its own unique scoped identity with access limited to exactly what it needs. No implicit trust is granted based on network location or account membership. Training jobs access only their specific input dataset and output location; serving infrastructure has read-only access to the model registry with no training credentials.
Why is the model registry a particularly critical security control point?
The model registry stores trained model weights, the core AI asset. Without access controls, any authenticated user or process can overwrite or substitute model versions, enabling backdoor insertion or capability degradation. Minimum controls include authentication for all access, role separation between CI/CD write access and serving read access, cryptographic signing of artifacts, immutable versioning, and audit logging of all pull and push events.
How should ML teams handle secrets like API keys and cloud credentials in training pipelines?
Secrets should never be hardcoded in notebooks or training scripts, stored in VM-level environment variables, committed to version control, or passed as plaintext in container environment variables. The preferred pattern is runtime injection from a secrets manager (AWS Secrets Manager, HashiCorp Vault), with the training job's IAM role scoped to access only the specific secrets it needs, combined with scheduled rotation.