Zanzibar-RL Stage 1: Policy Generation (SFT)

A fine-tuned Qwen3.5-9B model that translates natural language access control descriptions into correct Zanzibar policies for DefraDB/SourceHub.

Model Description

This is the Stage 1 SFT checkpoint of the zanzibar-rl project. The model was trained to generate:

  1. Valid Zanzibar policy YAML (resources, permissions, relations)
  2. Relationship tuples (actor-to-resource bindings)
  3. Test assertions (ALLOW/DENY) verifying policy correctness

Training Data

  • 3,144 examples across 7 access control domains:
    • Agent Permissioning (AI agent scoped DIDs, tool access, compartments)
    • Multi-Tenant Compartments (org isolation, cross-tenant audit)
    • Document Collaboration (Google Docs-style sharing, delegation)
    • Filesystem Hierarchy (directories, inheritance, groups)
    • Healthcare (HIPAA, patient/clinician roles, break-glass)
    • Supply Chain (shipper/consignee/customs, multi-party)
    • Industrial/Edge (IoT, defense, satellite, retail)
  • 70 human-curated seed examples expanded via Codex/Gemini coding agents
  • 100% validated through acp_core PlaygroundService.Simulate
  • ChatML messages format with system/user/assistant roles

Training Details

Parameter Value
Base model Qwen/Qwen3.5-9B
Method Full fine-tune (FSDP)
Hardware 2x DGX Spark GB10 (128GB each)
Precision BF16
Epochs 3
Batch size 1 per device, 8 gradient accumulation
Learning rate 2e-5
Max sequence length 2048
Gradient checkpointing Yes
Final train loss 0.14
Final eval loss 0.17

Evaluation Results (Stage 1)

Metric Score
Format correct (has policy/relationships/tests blocks) 100%
Valid YAML 100%
acp_core Simulate pass rate 58%
All theorem assertions pass 58%

The 42% failures are primarily DID format issues in the theorem parser, not incorrect policies. The model produces semantically reasonable policies with correct relation types and permission expressions.

Intended Use

This model is a component of a larger system for automated Zanzibar policy generation. It is designed to be:

  1. Used as a foundation for Stage 2 SFT (adding modification and reasoning capabilities)
  2. Evaluated and harvested through the real policy-agent harness against DefraDB
  3. Eventually embedded in an agent that manages DefraDB/SourceHub access control

Input/Output Format

Input (user message)

<|schema|>
type document { name: String }
<|/schema|>

<|entities|>
users:
  - alice (did:key:z6Mk...)
  - bob (did:key:z6Mk...)
<|/entities|>

<|request|>
Alice owns the document. Bob should be able to read it but not edit or delete it.
<|/request|>

Output (assistant message)

<|policy|>
name: doc_sharing
resources:
  - name: document
    permissions:
      - name: read
        expr: reader
      - name: update
      - name: delete
    relations:
      - name: reader
        types:
          - actor
<|/policy|>

<|relationships|>
document:project_plan#owner@did:key:z6Mk...
document:project_plan#reader@did:key:z6Mk...
<|/relationships|>

<|tests|>
ALLOW: document:project_plan#read@did:key:z6Mk...
DENY: document:project_plan#update@did:key:z6Mk...
<|/tests|>

Next Steps

  • Stage 2 SFT: Train on Layer 2 (policy modification via unified diffs) and Layer 3 (policy reasoning with structured Q&A) — 6,092 total examples
  • Stage 3 Agent RL: harness-driven policy-agent trajectory collection and subsequent RL on the real operator substrate
  • Quantization: GGUF 4-bit for deployment on consumer hardware

Repository

Training code, dataset, and validation tools: sourcenetwork/zanzibar-rl

Citation

@misc{zanzibar-rl-2026,
  title={Zanzibar-RL: Fine-tuning Language Models for Zanzibar Policy Generation},
  author={Source Network},
  year={2026},
  url={https://github.com/sourcenetwork/zanzibar-rl}
}
Downloads last month
84
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jackzampolin/zanzibar-rl-full-20260424

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(395)
this model