You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

intent_safety_clf_4b

intent_safety_clf_4bctx_safety_clf 二分类任务的 Qwen3-4B 合并模型,用于判断 RP 对话里 user 当前意图是否正在推进、延续、回味或暗示色情/性相关内容

实际可加载模型目录:

/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b

二分类标签来自四类意图标签的折叠:

  • unsafe -> label 1
  • safe / flirt / unknown -> label 0

训练时的关键信息

  • 部署底座:/data/mawenzhuo/workspace/models/Qwen3-4B
  • 模型结构:Qwen3ForSequenceClassification
  • 类别数:2
  • 训练数据 pipeline:/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf
  • 训练数据来源:ctx_safety_gemini3flash_ctx21_20k/ctx_safety_train.jsonl
  • 使用的任务描述:精简版 /data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt
  • 输入格式版本:task_description_ctx_only_v1
  • 上下文轮数:最近 8 条 dialogue 记录
  • 不传 bot 当前回复,只传 ctx
  • 训练文本格式:
任务描述:
{task_description}

ctx:
{ctx_text}
  • 原始池样本:18,282
  • 原始池标签计数:
    • label 0: 9,448
    • label 1: 8,834
  • 训练集采样:unsafe:non_unsafe = 1:4
    • train label 0: 9,353
    • train label 1: 2,338
    • test label 0: 95
    • test label 1: 88
  • 训练时开启:
    • token 预裁剪:保留任务描述,只裁剪 ctx 左侧过长内容,最大 4096 tokens
    • Focal Loss:gamma=1
    • class weights
    • QLoRA:r=64, alpha=128, dropout=0.05

参考文件:

/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/steps/02_prepare_ctx_safety_train_eval.py
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/steps/pack_ctx8_shortprompt_and_plainctx_train_bundles.sh
/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b/merge_manifest.json

train_prompt = """你是用户意图安全审核专员,负责分析角色扮演(RP)对话中 user 当前话语在上下文中的真实推进方向。 你需要基于上文语境,仅评估 user 最近话语的综合意图,判断 user 当前是在:

  • 推进、延续、回味、暗示色情/性相关内容
  • 停止、拒绝、撤出、降温、转移色情/性相关内容
  • 普通 RP / 暧昧互动
  • 证据不足,无法可靠判断 你必须保持谨慎、客观、中立,不根据个人道德、文化偏好、角色设定善恶、剧情题材进行判断。暴力、违法、犯罪、黑暗剧情本身不属于 unsafe;只有当 user 当前意图与性相关时,才进入 unsafe 判断。"""

推理输入要求

只传 ctx,不传 bot 当前回复。

ctx 建议保持训练一致:

  • 只取最近 8 条有效 dialogue 记录。
  • 每条格式为 [user] ... / [assistant] ... / [system] ...
  • 最新 user 话语应包含在 ctx 末尾。
  • 如果接口入参里最后一条 assistant 是待判定 bot response,先移除这条 assistant;如果最后一条是 user,则直接使用。
  • 不要额外拼接 resp
  • content 中已有的角色名前缀、persona/details 文本按原文保留,不额外清洗。

示例:

[system] Character persona/details...

[assistant] Are you sure you want to stay here?

[user] I step closer and whisper that I want to continue.

Python 使用示例

from pathlib import Path

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


MODEL_DIR = Path(
    "/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b"
)
TASK_DESCRIPTION_PATH = Path(
    "/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt"
)
MAX_LENGTH = 4096
THRESHOLD = 0.5


tokenizer = AutoTokenizer.from_pretrained(
    MODEL_DIR,
    trust_remote_code=True,
    use_fast=True,
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_DIR,
    num_labels=2,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

task_description = TASK_DESCRIPTION_PATH.read_text(encoding="utf-8").strip()


def format_ctx(turns, max_turns=8, drop_trailing_assistant=True):
    """turns: [{'role': 'user'|'assistant'|'system', 'content': str}, ...]"""
    role_map = {
        "user": "user",
        "you": "user",
        "assistant": "assistant",
        "bot": "assistant",
        "system": "system",
    }
    normalized = []
    for turn in turns:
        role = role_map.get(str(turn.get("role", "")).lower(), "unknown")
        text = str(turn.get("content", "")).strip()
        if text:
            normalized.append({"role": role, "text": text})

    if drop_trailing_assistant and normalized and normalized[-1]["role"] == "assistant":
        normalized = normalized[:-1]

    selected = normalized[-max_turns:]
    return "\n\n".join(f"[{turn['role']}] {turn['text']}" for turn in selected).strip()


def build_model_text(ctx_text):
    prefix = f"任务描述:\n{task_description}\n\nctx:\n"
    fixed_token_count = len(tokenizer(prefix, add_special_tokens=False)["input_ids"])
    ctx_budget = max(0, MAX_LENGTH - fixed_token_count - 8)

    ctx_ids = tokenizer(ctx_text.strip(), add_special_tokens=False)["input_ids"]
    if len(ctx_ids) > ctx_budget:
        ctx_ids = ctx_ids[-ctx_budget:] if ctx_budget > 0 else []
    ctx_text = tokenizer.decode(ctx_ids, skip_special_tokens=False).strip()
    return f"{prefix}{ctx_text}".strip()


@torch.inference_mode()
def predict_intent_safety(turns, threshold=THRESHOLD):
    ctx_text = format_ctx(turns)
    model_text = build_model_text(ctx_text)
    encoded = tokenizer(
        model_text,
        truncation=True,
        max_length=MAX_LENGTH,
        padding=True,
        return_tensors="pt",
    )
    encoded = {key: value.to(model.device) for key, value in encoded.items()}
    logits = model(**encoded).logits.float()
    probs = torch.softmax(logits, dim=-1)[0]
    prob_non_unsafe = float(probs[0].cpu())
    prob_unsafe = float(probs[1].cpu())
    pred_label = int(prob_unsafe >= threshold)
    return {
        "label": pred_label,
        "label_text": "unsafe" if pred_label == 1 else "non_unsafe",
        "prob_non_unsafe": prob_non_unsafe,
        "prob_unsafe": prob_unsafe,
        "threshold": threshold,
    }

Threshold 说明

默认使用 threshold=0.5,即 prob_unsafe >= threshold 判为 unsafe

注意事项

  • 这是二分类模型,不是 ctx_safety 四分类模型。
  • 只判断 user 当前意图,不判断 bot 回复是否违规。
  • 暴力、犯罪、黑暗剧情本身不等于 unsafe;只有 user 当前正在推进、延续、回味或暗示性相关内容才应判 unsafe
  • 线上必须保持训练输入格式一致:精简任务描述 + ctx-only + 最近 8 条 dialogue + token 预裁剪。
  • 不要直接对整段超长文本做 tokenizer 左截断,否则可能截掉任务描述;应按示例只裁剪 ctx。
Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support