Zanzibar-RL Stage 1: Policy Generation (SFT)
A fine-tuned Qwen3.5-9B model that translates natural language access control descriptions into correct Zanzibar policies for DefraDB/SourceHub.
Model Description
This is the Stage 1 SFT checkpoint of the zanzibar-rl project. The model was trained to generate:
- Valid Zanzibar policy YAML (resources, permissions, relations)
- Relationship tuples (actor-to-resource bindings)
- Test assertions (ALLOW/DENY) verifying policy correctness
Training Data
- 3,144 examples across 7 access control domains:
- Agent Permissioning (AI agent scoped DIDs, tool access, compartments)
- Multi-Tenant Compartments (org isolation, cross-tenant audit)
- Document Collaboration (Google Docs-style sharing, delegation)
- Filesystem Hierarchy (directories, inheritance, groups)
- Healthcare (HIPAA, patient/clinician roles, break-glass)
- Supply Chain (shipper/consignee/customs, multi-party)
- Industrial/Edge (IoT, defense, satellite, retail)
- 70 human-curated seed examples expanded via Codex/Gemini coding agents
- 100% validated through acp_core PlaygroundService.Simulate
- ChatML messages format with system/user/assistant roles
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-9B |
| Method | Full fine-tune (FSDP) |
| Hardware | 2x DGX Spark GB10 (128GB each) |
| Precision | BF16 |
| Epochs | 3 |
| Batch size | 1 per device, 8 gradient accumulation |
| Learning rate | 2e-5 |
| Max sequence length | 2048 |
| Gradient checkpointing | Yes |
| Final train loss | 0.14 |
| Final eval loss | 0.17 |
Evaluation Results (Stage 1)
| Metric | Score |
|---|---|
| Format correct (has policy/relationships/tests blocks) | 100% |
| Valid YAML | 100% |
| acp_core Simulate pass rate | 58% |
| All theorem assertions pass | 58% |
The 42% failures are primarily DID format issues in the theorem parser, not incorrect policies. The model produces semantically reasonable policies with correct relation types and permission expressions.
Intended Use
This model is a component of a larger system for automated Zanzibar policy generation. It is designed to be:
- Used as a foundation for Stage 2 SFT (adding modification and reasoning capabilities)
- Evaluated and harvested through the real
policy-agentharness against DefraDB - Eventually embedded in an agent that manages DefraDB/SourceHub access control
Input/Output Format
Input (user message)
<|schema|>
type document { name: String }
<|/schema|>
<|entities|>
users:
- alice (did:key:z6Mk...)
- bob (did:key:z6Mk...)
<|/entities|>
<|request|>
Alice owns the document. Bob should be able to read it but not edit or delete it.
<|/request|>
Output (assistant message)
<|policy|>
name: doc_sharing
resources:
- name: document
permissions:
- name: read
expr: reader
- name: update
- name: delete
relations:
- name: reader
types:
- actor
<|/policy|>
<|relationships|>
document:project_plan#owner@did:key:z6Mk...
document:project_plan#reader@did:key:z6Mk...
<|/relationships|>
<|tests|>
ALLOW: document:project_plan#read@did:key:z6Mk...
DENY: document:project_plan#update@did:key:z6Mk...
<|/tests|>
Next Steps
- Stage 2 SFT: Train on Layer 2 (policy modification via unified diffs) and Layer 3 (policy reasoning with structured Q&A) — 6,092 total examples
- Stage 3 Agent RL: harness-driven policy-agent trajectory collection and subsequent RL on the real operator substrate
- Quantization: GGUF 4-bit for deployment on consumer hardware
Repository
Training code, dataset, and validation tools: sourcenetwork/zanzibar-rl
Citation
@misc{zanzibar-rl-2026,
title={Zanzibar-RL: Fine-tuning Language Models for Zanzibar Policy Generation},
author={Source Network},
year={2026},
url={https://github.com/sourcenetwork/zanzibar-rl}
}
- Downloads last month
- 84