Published
- 5 min read
Zero Trust Architecture for ML Pipelines: A Practitioner Guide
ML infrastructure has become a high-value target for adversaries, yet most organisations apply far less rigorous access controls to their AI workloads than to their conventional application stacks. Training jobs routinely run with over-privileged cloud credentials. Model registries have weak access controls. Experiment tracking platforms expose sensitive data without authentication. This guide covers how to apply zero trust principles to the full ML lifecycle.
The Problem: ML Infrastructure Is Over-Privileged by Default
A typical ML training job in a cloud environment needs:
- Read access to training data in object storage
- Write access to model checkpoints
- Access to a secrets manager (for API keys used by the training code)
- Optionally, access to an experiment tracking service
In practice, training jobs are frequently run with credentials that have:
- Broad read/write access to entire S3 buckets or GCS projects
- Access to secrets far beyond what the job needs
- IAM permissions that would allow lateral movement to other services
The implicit trust model is “if you can run the training job, you can do anything the training job’s credentials allow” — the opposite of zero trust.
Principle 1: Identity for Every Workload
Every ML workload — training job, serving instance, notebook, pipeline step — should have its own unique identity, with access scoped to exactly what that workload needs.
In AWS:
# Terraform: training job IAM role
resource "aws_iam_role" "training_job" {
name = "sagemaker-training-job-role"
assume_role_policy = jsonencode({
Version = "2012-10-17",
Statement = [{
Effect = "Allow",
Principal = { Service = "sagemaker.amazonaws.com" },
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "training_job_data_access" {
role = aws_iam_role.training_job.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = ["s3:GetObject"],
Resource = "arn:aws:s3:::my-training-data/datasets/my-dataset/*"
},
{
Effect = "Allow",
Action = ["s3:PutObject"],
Resource = "arn:aws:s3:::my-model-registry/checkpoints/my-model/*"
}
]
})
}
Avoid s3:* or wildcard resource ARNs. Each training job should be able to read exactly its input dataset and write exactly its output location.
In Kubernetes:
Use Workload Identity (GKE) or IRSA (EKS) to bind a Kubernetes service account to a cloud IAM role, rather than mounting credentials as secrets.
apiVersion: v1
kind: ServiceAccount
metadata:
name: training-job-sa
namespace: ml-training
annotations:
iam.gke.io/workload-identity-pool: "my-project.svc.id.goog"
iam.gke.io/service-account: "[email protected]"
Principle 2: Network Segmentation
ML workloads should be isolated at the network level, even within a private cloud environment.
Training subnet isolation:
- Training jobs should run in a dedicated subnet with no direct internet access
- Outbound traffic should route through a managed NAT gateway or proxy with allowlist filtering
- Block outbound access to known data exfiltration targets (paste sites, file-sharing services, tunnelling endpoints)
Inference endpoint separation:
- Serving infrastructure should be segregated from training infrastructure
- The serving layer should have read-only access to the model registry; it should never have training credentials
- Use a service mesh (Istio, Linkerd) with mutual TLS between inference components
Jupyter notebook isolation:
- Notebooks should not run with training-level credentials
- Use a separate IAM role for interactive notebooks, scoped to read-only data access
- Consider notebook-as-a-service platforms with per-user identity and session isolation
Principle 3: Model Registry Access Controls
The model registry — the store of trained model weights — is often the least protected component of ML infrastructure. A model registry without access controls is equivalent to a source code repository with no authentication.
Minimum controls for a model registry:
| Control | Implementation |
|---|---|
| Authentication | All access requires a valid identity (no anonymous pull) |
| Role separation | Separate roles for write (CI/CD pipeline) and read (serving) |
| Signing | Sign model artifacts with a key managed by a KMS |
| Audit logging | All pull and push events logged to immutable audit log |
| Version immutability | Published model versions cannot be overwritten (append-only) |
For Hugging Face Hub (self-hosted or cloud):
# Require authentication for all model access
from huggingface_hub import login
# Use token with minimal permissions (read-only for serving)
login(token=os.environ["HF_READ_TOKEN"])
For MLflow Model Registry:
# Set up access control — only CI/CD service account can register models
mlflow.set_registry_uri("databricks")
# Use Databricks ACLs to restrict who can register to production stage
Principle 4: Secrets Management
ML workloads frequently need access to API keys (OpenAI, Anthropic, data provider APIs), database credentials, and cloud service tokens. These secrets must not be:
- Hardcoded in notebooks or training scripts
- Stored in environment variables set at the VM level (visible to all processes)
- Committed to version control (even private repositories)
- Passed as plaintext in container environment variables
Preferred pattern — secrets injection at runtime:
import boto3
def get_api_key(secret_name: str) -> str:
client = boto3.client("secretsmanager", region_name="us-east-1")
response = client.get_secret_value(SecretId=secret_name)
return response["SecretString"]
# At training job start — not hardcoded, not in env
openai_key = get_api_key("ml-training/openai-api-key")
The training job IAM role should have access to exactly the secrets it needs, and no others.
Secret rotation: API keys used in training pipelines should be rotated on a scheduled basis. Use AWS Secrets Manager or HashiCorp Vault’s automatic rotation capabilities where possible.
Principle 5: Immutable Audit Logging
All access to training data, model registry operations, and secrets retrievals should be logged to an immutable audit trail. This enables:
- Detection of anomalous access (data exfiltration, unauthorised model reads)
- Forensic investigation following an incident
- Compliance with data governance requirements
Key events to log:
- Dataset read operations (who accessed which data, when)
- Model registration and promotion events
- Secret access events
- Training job launches (who triggered them, with what parameters)
- Inference requests to production models (for compliance use cases)
# Structured audit log entry for dataset access
import structlog
log = structlog.get_logger()
log.info(
"dataset_accessed",
dataset="s3://my-bucket/training-data/v3/",
job_id=os.environ.get("TRAINING_JOB_ID"),
identity=get_current_identity(),
record_count=len(dataset),
)
Continuous Compliance Monitoring
Zero trust is not a one-time configuration — it requires ongoing validation. Consider implementing:
- IAM policy drift detection — alert when ML workload policies are broadened
- Unusual access pattern detection — flag training jobs that access data outside their expected scope
- Dependency integrity verification — verify checksums of ML libraries at container build time
- Model behavior monitoring — detect distributional shifts in model outputs that may indicate tampering
Applied consistently, these controls significantly reduce the attack surface of ML infrastructure without materially impeding engineering velocity. The key is automation — manual policy reviews are insufficient at the pace modern ML teams iterate.