Zanzibar-RL Stage 1: Policy Generation (SFT)

A fine-tuned Qwen3.5-9B model that translates natural language access control descriptions into correct Zanzibar policies for DefraDB/SourceHub.

Model Description

This is the Stage 1 SFT checkpoint of the zanzibar-rl project. The model was trained to generate:

Valid Zanzibar policy YAML (resources, permissions, relations)
Relationship tuples (actor-to-resource bindings)
Test assertions (ALLOW/DENY) verifying policy correctness

Training Data

3,144 examples across 7 access control domains:
- Agent Permissioning (AI agent scoped DIDs, tool access, compartments)
- Multi-Tenant Compartments (org isolation, cross-tenant audit)
- Document Collaboration (Google Docs-style sharing, delegation)
- Filesystem Hierarchy (directories, inheritance, groups)
- Healthcare (HIPAA, patient/clinician roles, break-glass)
- Supply Chain (shipper/consignee/customs, multi-party)
- Industrial/Edge (IoT, defense, satellite, retail)
70 human-curated seed examples expanded via Codex/Gemini coding agents
100% validated through acp_core PlaygroundService.Simulate
ChatML messages format with system/user/assistant roles

Training Details

Parameter	Value
Base model	Qwen/Qwen3.5-9B
Method	Full fine-tune (FSDP)
Hardware	2x DGX Spark GB10 (128GB each)
Precision	BF16
Epochs	3
Batch size	1 per device, 8 gradient accumulation
Learning rate	2e-5
Max sequence length	2048
Gradient checkpointing	Yes
Final train loss	0.14
Final eval loss	0.17

Evaluation Results (Stage 1)

Metric	Score
Format correct (has policy/relationships/tests blocks)	100%
Valid YAML	100%
acp_core Simulate pass rate	58%
All theorem assertions pass	58%

The 42% failures are primarily DID format issues in the theorem parser, not incorrect policies. The model produces semantically reasonable policies with correct relation types and permission expressions.

Intended Use

This model is a component of a larger system for automated Zanzibar policy generation. It is designed to be:

Used as a foundation for Stage 2 SFT (adding modification and reasoning capabilities)
Evaluated and harvested through the real policy-agent harness against DefraDB
Eventually embedded in an agent that manages DefraDB/SourceHub access control

Input/Output Format

Input (user message)

<|schema|>
type document { name: String }
<|/schema|>

<|entities|>
users:
  - alice (did:key:z6Mk...)
  - bob (did:key:z6Mk...)
<|/entities|>

<|request|>
Alice owns the document. Bob should be able to read it but not edit or delete it.
<|/request|>

Output (assistant message)

<|policy|>
name: doc_sharing
resources:
  - name: document
    permissions:
      - name: read
        expr: reader
      - name: update
      - name: delete
    relations:
      - name: reader
        types:
          - actor
<|/policy|>

<|relationships|>
document:project_plan#owner@did:key:z6Mk...
document:project_plan#reader@did:key:z6Mk...
<|/relationships|>

<|tests|>
ALLOW: document:project_plan#read@did:key:z6Mk...
DENY: document:project_plan#update@did:key:z6Mk...
<|/tests|>

Next Steps

Stage 2 SFT: Train on Layer 2 (policy modification via unified diffs) and Layer 3 (policy reasoning with structured Q&A) — 6,092 total examples
Stage 3 Agent RL: harness-driven policy-agent trajectory collection and subsequent RL on the real operator substrate
Quantization: GGUF 4-bit for deployment on consumer hardware

Repository

Training code, dataset, and validation tools: sourcenetwork/zanzibar-rl

Citation

@misc{zanzibar-rl-2026,
  title={Zanzibar-RL: Fine-tuning Language Models for Zanzibar Policy Generation},
  author={Source Network},
  year={2026},
  url={https://github.com/sourcenetwork/zanzibar-rl}
}

Downloads last month: 84

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for jackzampolin/zanzibar-rl-full-20260424

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(395)

this model