Spaces:
Running
AI SBOM Generator System Architecture
Overview
The AI SBOM Generator is a configurable system that automatically generates Software Bill of Materials (SBOM) documents for AI models hosted on HuggingFace. The system uses a registry-driven architecture that allows for dynamic field configuration without code changes.
System Architecture
Core Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI SBOM Generator β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Web Interface (FastAPI + HTML Templates) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β API Layer β
β βββ Generation Endpoints β
β βββ Scoring Endpoints β
β βββ Batch Processing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Core Generation Engine β
β βββ AIBOMGenerator (generator.py) β
β βββ Enhanced Extractor (enhanced_extractor.py) β
β βββ Field Registry Manager (field_registry_manager.py)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Configuration Layer β
β βββ Field Registry (field_registry.json) β
β βββ Scoring Configuration β
β βββ AIBOM Generation Rules β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Sources β
β βββ HuggingFace API β
β βββ Model Cards β
β βββ Configuration Files β
β βββ README Content β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Features
- Registry-Driven Configuration: All fields and scoring rules defined in JSON
- Multi-Strategy Extraction: 6 different extraction methods per field
- Standards Compliance: CycloneDX 1.6 compatible output
- Configurable Scoring: Weighted scoring system with tier-based multipliers
- Automatic Field Discovery: New fields added to registry are automatically processed
- Comprehensive Logging: Detailed extraction and scoring logs for debugging
Process Workflow
1. System Initialization
System Initialization Process:
βββββββββββββββββββββ
β System Startup β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Load Field β
β Registry β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Initialize β
β Registry Manager β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββ
β Load Scoring β
β Configuration β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Initialize β
β Enhanced β
β Extractor β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β System Ready β
βββββββββββββββββββ
Steps:
- Load Field Registry: Read
field_registry.json
containing all field definitions - Initialize Registry Manager: Create manager instance with loaded configuration
- Load Scoring Configuration: Parse scoring weights, tiers, and category definitions
- Initialize Enhanced Extractor: Create extractor with registry-driven field discovery
- System Ready: All components initialized and ready for SBOM generation
2. SBOM Generation Process
SBOM Generation Workflow:
User Request βββ
β
βΌ
βββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββ
β Validate Model βββββββΆβ Fetch Model Info βββββΆβ Initialize β
β ID β β β β Enhanced β
βββββββββββββββββββββ ββββββββββββββββββββββ β Extractor β
ββββββββββββ¬ββββββββ
β
βββββββββββββββββββββ ββββββββββββββββββββ β
β Return SBOM + ββββββ Calculate βββββββββββββββββββ
β Score β β Completeness β
βββββββββββββββββββββ β Score β
ββββββββββββββββββββ
β²
β
ββββββββββββββββββββββ
β Generate AIBOM β
β Structure β
ββββββββββββββββββββββ
β²
β
ββββββββββββββββββββββ
β Multi-Strategy β
β Field Processing β
ββββββββββββββββββββββ
β²
β
ββββββββββββββββββββββ
β Registry-Driven β
β Extraction β
ββββββββββββββββββββββ
2.1 Model Information Gathering
Input: HuggingFace model ID (e.g., deepseek-ai/DeepSeek-R1
)
Process:
- Validate Model ID: Check format and accessibility
- Fetch Model Info: Retrieve metadata from HuggingFace API
- Download Model Card: Get structured model documentation
- Fetch Configuration Files: Download
config.json
,tokenizer_config.json
- Extract README Content: Parse model description and documentation
2.2 Registry-Driven Field Extraction
For each of the 29 registry fields:
Multi-Strategy Field Extraction:
Field from Registry
β
βΌ
βββββββββββββββββββββ Success?
β Strategy 1: ββββββββββ
β HuggingFace API β β
βββββββββββββββββββββ β
β β
β Failure β
βΌ β
βββββββββββββββββββββ β
β Strategy 2: β β
β Model Card β β
βββββββββββββββββββββ β
β β
β Failure β
βΌ β
βββββββββββββββββββββ β
β Strategy 3: β β
β Config Files β β
βββββββββββββββββββββ β
β β
β Failure β
βΌ β
βββββββββββββββββββββ β
β Strategy 4: β β
β Text Patterns β β
βββββββββββββββββββββ β
β β
β Failure β
βΌ β
βββββββββββββββββββββ β
β Strategy 5: β β
β Intelligent β β
β Inference β β
βββββββββββββββββββββ β
β β
β Failure β
βΌ β
βββββββββββββββββββββ β
β Strategy 6: β β
β Fallback Value β β
βββββββββββββββββββββ β
β β
βΌ β
ββββββββββββββββββββββββββββββ
β Store Result & β
β Log Outcome β
βββββββββββββββββββββ
Extraction Strategies:
HuggingFace API Extraction
- Direct field mapping from API response
- High confidence, structured data
- Fields:
name
,author
,license
,tags
, etc.
Model Card Extraction
- Parse structured model card YAML/metadata
- Medium-high confidence
- Fields:
limitation
,metrics
,datasets
, etc.
Configuration File Extraction
- Mine technical details from config files
- High confidence for technical fields
- Fields:
typeOfModel
,hyperparameter
, etc.
Text Pattern Extraction
- Regex-based extraction from README content
- Medium confidence, requires validation
- Fields:
safetyRiskAssessment
,informationAboutTraining
, etc.
Intelligent Inference
- Smart defaults based on model characteristics
- Medium confidence, contextual
- Fields:
primaryPurpose
,domain
, etc.
Fallback Values
- Placeholder values when no data available
- Low/no confidence, maintains structure
- Ensures complete SBOM structure
2.3 AIBOM Structure Generation
Process:
- Create Base Structure: Initialize CycloneDX 1.6 compliant structure
- Populate Metadata Section: Add extracted metadata fields
- Build Component Section: Create model component with extracted data
- Add Model Card: Include AI-specific model card information
- Generate External References: Add distribution and repository links
- Create Dependencies: Define model dependencies and relationships
- Validate Structure: Ensure CycloneDX compliance
Output Structure:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:...",
"version": 1,
"metadata": {
"timestamp": "...",
"tools": [...],
"component": {...},
"properties": [...]
},
"components": [{
"type": "machine-learning-model",
"name": "...",
"modelCard": {...},
"properties": [...]
}],
"externalReferences": [...],
"dependencies": [...]
}
3. Completeness Scoring Process
Completeness Scoring Process:
βββββββββββββββββββββ
β Extracted Fields β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Categorize β
β Fields β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Apply Tier β
β Weights β
β β’ Critical: 3x β
β β’ Important: 2x β
β β’ Supplement: 1x β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Calculate β
β Category Scores β
β β’ Required: 20 β
β β’ Metadata: 20 β
β β’ Basic: 20 β
β β’ ModelCard: 30 β
β β’ ExtRefs: 10 β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Sum Weighted β
β Scores β
β (Max: 100) β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Generate Score β
β Report β
βββββββββββββββββββββ
Scoring Algorithm:
- Field Categorization: Group fields by category (required_fields, metadata, etc.)
- Tier Weight Application: Apply multipliers (Critical: 3x, Important: 2x, Supplementary: 1x)
- Category Score Calculation:
(Fields Present / Total Fields) Γ Category Weight
- Final Score: Sum of all category scores (max 100)
Category Weights:
- Required Fields: 20 points
- Metadata: 20 points
- Component Basic: 20 points
- Component Model Card: 30 points
- External References: 10 points
4. Output Generation
Generated Artifacts:
- AIBOM JSON: CycloneDX 1.6 compliant SBOM document
- Completeness Score: Numerical score (0-100) with breakdown
- Field Checklist: Detailed field-by-field analysis
- Extraction Report: Confidence levels and data sources
- Validation Results: Compliance and quality checks
Configuration Management
Field Registry Structure
The system is driven by field_registry.json
which defines:
- Field Definitions: All 29 extractable fields
- Scoring Configuration: Weights, tiers, and categories
- AIBOM Generation Rules: Structure and validation rules
- Extraction Strategies: How each field should be extracted
Dynamic Configuration
Adding New Fields:
- Add field definition to
field_registry.json
- System automatically discovers and attempts extraction
- No code changes required
Updating Scoring:
- Modify weights in registry configuration
- Changes take effect immediately
- Consistent scoring across all models
Quality Assurance
Validation Layers
- Input Validation: Model ID format and accessibility
- Extraction Validation: Data type and format checking
- Structure Validation: CycloneDX schema compliance
- Scoring Validation: Mathematical correctness
- Output Validation: JSON schema and completeness
Error Handling
- Individual Field Failures: Don't stop overall processing
- Graceful Degradation: Fallback to lower-confidence strategies
- Comprehensive Logging: Detailed error tracking and debugging
- Recovery Mechanisms: Automatic retry and alternative approaches
Performance Characteristics
Typical Processing Times
- Single Model: 2-5 seconds
- Batch Processing: 10-50 models/minute
- Registry Loading: <1 second
- Field Extraction: 1-3 seconds per model
Scalability Features
- Concurrent Processing: Multiple models processed simultaneously
- Caching: Model metadata and configuration caching
- Rate Limiting: Respectful API usage
- Resource Management: Memory and connection pooling
Integration Points
APIs
- Generation API:
/api/generate
- Single model AI SBOM generation, with download URL - Generation with Completness Score Report API:
/api/generate-with-report
- Generation API with completness scoring report - Completness Score Report Only API:
/api/models/{model_id}/score
- Get the completeness score for a model without generating AI SBOM
Data Sources
- HuggingFace Hub: Primary model metadata source
- Model Repositories: Direct file access for configurations
- Model Cards: Structured documentation parsing
Output Formats
- CycloneDX JSON: Primary SBOM format
- Field Reports: Human-readable analysis
- CSV Exports: Batch processing results
- API Responses: Structured JSON for integration
This architecture provides a robust, configurable, and standards-compliant solution for AI model SBOM generation with comprehensive field extraction and scoring capabilities.