Skip to content
AI Security Wire

Published

- 5 min read

By

AI Software Bill of Materials: Tracking Model Components

img of AI Software Bill of Materials: Tracking Model Components

Ask any security engineer what version of a running library their application depends on and they can tell you within seconds. Ask them what training data went into the model they deployed last quarter, and you’ll get a blank stare. That gap (between how we track software and how we track AI components) is where supply chain risk accumulates.

The AI-SBOM concept isn’t new anymore. EU AI Act Article 11 requires technical documentation covering model provenance for high-risk systems. NIST AI RMF 2.0 treats model component tracking as a baseline control. What’s still missing in most organisations is the operational implementation: actually generating, storing, and querying these records in a way that’s useful when you need them.

What a Traditional SBOM Misses

Standard SBOM tooling covers code libraries and dependencies well. Pull in Syft or Trivy, get your package inventory. But an AI system’s component graph looks nothing like a conventional application dependency tree. What standard SBOM tools won’t capture:

ComponentWhy It Matters
Base model (weights + architecture)Backdoors, biases, and capabilities are properties of the weights
Fine-tuning datasetDataset provenance affects copyright, PII, and poisoning risk
RLHF / alignment dataDetermines safety behaviour; manipulation here affects all downstream uses
LoRA / adapter weightsCan override base model behaviour; need independent provenance
Prompt templates / system promptsDefine application behaviour; versioning and integrity matter
Inference frameworkSerialisation vulnerabilities, hardware-specific behaviour
Embedding model (for RAG)Affects retrieval; poisoning here affects all downstream queries

A security incident or a compliance audit will require answering questions about every item on that list. Without an AI-SBOM, you’re improvising under pressure: the worst possible time to start building an inventory.

What the Records Should Actually Contain

CycloneDX with its ML extensions (cdx:ml) is the format to use here: it supports model components, training dataset records, and dependency graphs, and it’s what satisfies both EU AI Act and NIST requirements. The minimum fields for each component type:

Base Model Record

   {
  "type": "machine-learning-model",
  "name": "llama-3-70b-instruct",
  "version": "3.0",
  "purl": "pkg:huggingface/meta-llama/[email protected]",
  "hashes": [
    { "alg": "SHA-256", "content": "a3f2...b91c" }
  ],
  "supplier": { "name": "Meta AI", "url": "https://ai.meta.com" },
  "licenses": [{ "id": "LLAMA-3-COMMUNITY" }],
  "properties": [
    { "name": "training-compute-flops", "value": "1.8e24" },
    { "name": "training-data-cutoff", "value": "2023-12" },
    { "name": "parameters", "value": "70000000000" }
  ]
}

Fine-Tune / Adapter Record

   {
  "type": "machine-learning-model",
  "name": "llama-3-70b-customer-service-lora",
  "version": "1.4.2",
  "hashes": [{ "alg": "SHA-256", "content": "7c4a...e230" }],
  "dependencies": ["pkg:huggingface/meta-llama/[email protected]"],
  "properties": [
    { "name": "adapter-type", "value": "LoRA" },
    { "name": "training-dataset-id", "value": "ds-customer-service-v3" },
    { "name": "training-date", "value": "2026-03-15" },
    { "name": "trainer", "value": "[email protected]" }
  ]
}

Training Dataset Record

   {
  "type": "data",
  "name": "customer-service-training-v3",
  "version": "3.0",
  "hashes": [{ "alg": "SHA-256", "content": "1b9f...4d72" }],
  "properties": [
    { "name": "record-count", "value": "142000" },
    { "name": "pii-assessed", "value": "true" },
    { "name": "pii-assessment-date", "value": "2026-02-28" },
    { "name": "data-sources", "value": "internal-crm,synthetic-generation" },
    { "name": "collection-date-range", "value": "2024-01/2026-01" },
    { "name": "data-controller", "value": "example-corp" }
  ]
}

The hash on the training dataset is worth dwelling on. A hash means you can verify, at any future point, that the dataset you say was used is the dataset that was actually used. Without it, provenance claims are unverifiable assertions.

Generating Records Programmatically

Don’t write these by hand. Generate them at registration time and store them alongside the model artefact, otherwise you’ll have a documentation exercise that falls behind deployment reality within six months.

The cyclonedx-python-lib handles the CycloneDX ML format:

   from cyclonedx.model.bom import Bom
from cyclonedx.model.component import Component, ComponentType
from cyclonedx.model import HashType, HashAlgorithm, XsUri
from packageurl import PackageURL

bom = Bom()

model_component = Component(
    component_type=ComponentType.MACHINE_LEARNING_MODEL,
    name='llama-3-70b-instruct',
    version='3.0',
    purl=PackageURL(
        type='huggingface',
        namespace='meta-llama',
        name='Meta-Llama-3-70B-Instruct',
        version='3.0'
    ),
    hashes=[HashType(
        alg=HashAlgorithm.SHA_256,
        content='a3f2...b91c'
    )]
)

bom.components.add(model_component)

Wire it into your model registration workflow, not as a separate step that someone runs manually before a deployment, but as a required gate:

   import mlflow
import json

def register_model_with_sbom(model_path: str, sbom: dict, model_name: str):
    with mlflow.start_run():
        mlflow.log_artifact(model_path, "model")
        mlflow.log_dict(sbom, "ai-sbom.json")
        mlflow.set_tags({
            "sbom.version": sbom["version"],
            "sbom.base-model": sbom["components"][0]["name"],
            "sbom.training-date": sbom["metadata"]["timestamp"]
        })
        mlflow.register_model(
            f"runs:/{mlflow.active_run().info.run_id}/model",
            model_name
        )

Where This Actually Pays Off

Incident Response: The Compelling Use Case

A base model vulnerability gets disclosed. Or a training dataset turns out to contain sensitive customer records that weren’t supposed to be there. The question from the CISO’s office lands: “Which of our deployed systems are affected?”

With a properly populated registry, that’s a query:

   def find_deployments_using_model(base_model_purl: str, registry) -> list:
    return [
        deployment for deployment in registry.all_deployments()
        if base_model_purl in deployment.sbom.dependency_graph()
    ]

Without it, you’re emailing teams asking them to check their deployment configs. On a Sunday. Under time pressure.

Regulatory Compliance

The EU AI Act dataset documentation requirement under Article 11 is satisfied directly by the training dataset records above, assuming they include PII assessment status and data controller identity. CycloneDX satisfies both the AI Act and NIST AI RMF 2.0 supply chain requirements.

Third-Party Model Intake

Before deploying a model from a third party, require a signed AI-SBOM as part of your vendor intake process. Check that the base model hash matches the published release. Check that training dataset provenance is documented. Flag any dependencies on known-vulnerable model versions before they enter your environment, not after.

References

Frequently Asked Questions

What components does an AI-SBOM capture that a traditional SBOM does not?
An AI-SBOM extends the traditional code and library inventory to include base model weights and architecture, fine-tuning and RLHF datasets, LoRA and adapter weights, prompt templates, inference framework versions, and embedding models used in RAG pipelines, all of which carry distinct security and compliance risk.
Which standards or formats are recommended for implementing an AI-SBOM?
CycloneDX with its machine learning extensions (cdx:ml) is the leading format for AI-SBOMs. It supports model components, training dataset records, and dependency graphs. The NIST AI RMF 2.0 supply chain section and EU AI Act Article 11 both treat model component documentation as a baseline requirement, and CycloneDX satisfies both.
How can an AI-SBOM be used during incident response?
When a vulnerability is disclosed in a base model or ML library, the AI-SBOM registry can be queried to instantly identify every deployed system that uses the affected component. This replaces manual inventory searches with a programmatic lookup against cryptographically hashed component records.