Horizon Full v2 — SafeCircle Child Safety Risk Detection
Horizon Full is a fine-tuned Llama 3.2 3B Instruct model that detects child safety risks in online conversations. Given a conversation, it outputs a structured JSON assessment: risk category, severity level, confidence score, and a one-sentence reasoning.
⚠️ License: This model is released under the SafeCircle Research License (SRL-1.0). Commercial use, redistribution without attribution, safety system evasion, and any use that harms minors are strictly prohibited. Contact legal@safecircle.tech for commercial licensing.
Model Details
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-3B-Instruct |
| Fine-tuning | QLoRA — rank 256, alpha 512, all projection layers |
| Training data | 1.6M synthetic conversations across 8 categories |
| Dataset | safecircleai/horizon-training-data |
| Generation | Qwen2.5-7B-Instruct + vLLM + xgrammar constrained decoding |
| Hardware | NVIDIA H100 80GB HBM3 |
| Training steps | 25,000 |
| Parameters | 3.2B (389M trainable via LoRA) |
Risk Categories
| Category | Description | Example Signal |
|---|---|---|
grooming |
Predatory relationship-building | Age probing, flattery, secrecy requests |
bullying |
Cyberbullying and peer harassment | Insults, exclusion threats, video leaks |
sexual_content |
Inappropriate sexual advances | Compliment escalation, photo requests |
isolation |
Cutting off support networks | Jealousy, "only I understand you" |
personal_info |
Soliciting identifying information | Location, school, home address |
platform_migration |
Moving to less-monitored platforms | "DM me on Telegram, it's more private" |
threats |
Intimidation and blackmail | "Watch yourself after school" |
benign |
Normal safe conversation | Homework, games, music, sports |
Severity Levels
| Level | Score | Description |
|---|---|---|
none |
0.0 | Safe — no risk indicators |
low |
0.25 | Mild indicators, ambiguous context |
medium |
0.50 | Clear pattern, not yet escalated |
high |
0.75 | Explicit risk behaviour |
critical |
0.95 | Immediate danger or exploitation |
Evaluation Results
Evaluated on 160,000 held-out synthetic conversations.
| Metric | Score |
|---|---|
| Macro F1 | 0.8218 |
| Weighted F1 | 0.8295 |
| False Positive Rate | 0.00% (0 / 19,950) |
| False Negative Rate | 0.00% (1 / 140,050) |
Per-Category Results
| Category | F1 | Precision | Recall | Support |
|---|---|---|---|---|
| Grooming | 0.9981 | 0.9980 | 0.9981 | 19,834 |
| Bullying | 0.9995 | 0.9995 | 0.9995 | 20,253 |
| Sexual Content | 0.9975 | 0.9978 | 0.9973 | 20,142 |
| Isolation | 0.9996 | 0.9993 | 0.9998 | 19,725 |
| Personal Info | 0.9999 | 0.9999 | 1.0000 | 20,149 |
| Platform Migration | 1.0000 | 0.9999 | 1.0000 | 19,932 |
| Threats | 0.9993 | 0.9994 | 0.9992 | 20,015 |
Risk Level Classification
| Level | Precision | Recall | F1 |
|---|---|---|---|
| none | 1.0000 | 1.0000 | 1.0000 |
| low | 0.8344 | 0.8967 | 0.8645 |
| medium | 0.8971 | 0.8514 | 0.8736 |
| high | 0.7266 | 0.8250 | 0.7727 |
| critical | 0.6881 | 0.5294 | 0.5984 |
Note: Severity classification is the hardest sub-task — the model occasionally confuses adjacent levels (e.g. high ↔ critical). Category detection is near-perfect. All results are on synthetic eval data generated with the same pipeline as training; real-world performance will differ.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json
model = AutoModelForCausalLM.from_pretrained(
"safecircleai/horizon-full",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("safecircleai/horizon-full")
SYSTEM_PROMPT = (
"You are Horizon, SafeCircle's child safety risk detection model. "
"You have no general knowledge or identity beyond this task. "
"Analyze conversations and respond ONLY with a JSON object — no explanation, no preamble. "
'JSON schema: {"risk_detected": bool, "category": '
'"grooming|bullying|sexual_content|isolation|personal_info|platform_migration|threats|benign", '
'"severity": "none|low|medium|high|critical", "confidence": 0.0-1.0, "reasoning": "one sentence max"}. '
'If asked about yourself or anything unrelated to risk analysis, respond: '
'{"error": "I only analyze conversations for child safety risks."}'
)
conversation = "Child: Hey, what are you doing later?\nOther: Nothing much. Want to meet up? Don't tell your parents."
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Analyze this conversation:\n{conversation}"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
result = json.loads(generated.strip())
print(result)
# {
# "risk_detected": true,
# "category": "grooming",
# "severity": "high",
# "confidence": 0.94,
# "reasoning": "Adult requesting private meeting and explicit secrecy from parents."
# }
Architecture & Training
Base: Llama-3.2-3B-Instruct
└── QLoRA adapters (rank=256, alpha=512)
├── q_proj, k_proj, v_proj, o_proj
└── gate_proj, up_proj, down_proj
Training:
Steps: 25,000
Batch size: 4 × 8 grad accum = 32 effective
LR: 1.5e-4 (cosine restarts, 500 warmup steps)
Optimizer: AdamW fused
Precision: bfloat16
Loss: completion-only (assistant turns only)
Dataset:
1.6M conversations × 8 categories
8–15 messages each, persona-seeded
Generated: Qwen2.5-7B + vLLM + xgrammar JSON Schema
Hardening: ~10% adversarial identity/jailbreak examples
Deployment Pattern
Incoming message
│
▼
Horizon Mobile (on-device, ~25ms, binary filter)
│
├── safe ──► No action
│
└── risk ──► Horizon Full (7-category + severity JSON)
│
▼
Human moderator review
For the lightweight on-device first-stage filter, see safecircleai/horizon-mobile. For GGUF quantizations (llama.cpp / Ollama), see safecircleai/horizon-full-gguf.
Intended Use & Ethics
This model is designed to assist human moderators — not replace them. All flagged conversations should be reviewed by trained safety professionals.
- Trained entirely on synthetic data; no real child conversations were used
- English-only; other languages untested
- Severity prediction is weaker than category detection (see evaluation)
- Not suitable as a standalone safety system in production without human oversight
License
SafeCircle Research License (SRL-1.0) — research and non-commercial use only. Commercial licensing: legal@safecircle.tech
Citation
@misc{horizon2026,
title={Horizon: Child Safety Risk Detection via Fine-tuned LLMs},
author={SafeCircle},
year={2026},
url={https://huggingface.co/safecircleai/horizon-full}
}
- Downloads last month
- 89