Model Card for ToxSense (Multimodal Adversarial Safety Moderator)

Model Details

Model Description

ToxSense (formerly ModGuard) is a highly precise, multimodal AI safety moderator designed to detect complex, zero-shot adversarial hate speech, sarcasm, and benign confounders. Unlike standard binary classifiers, ToxSense uses a Chain-of-Thought (CoT) reasoning approach across both Image and Text modalities, outputting structured JSON to categorize content into granular safety taxonomies.

  • Developed by: Yug Birla
  • Model type: Causal Language Model (Fine-tuned for strict JSON-In/JSON-Out Multimodal Reasoning)
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: Open-source Qwen Base Architecture

Model Sources

Model Architecture and Objective

  • Text Engine: Qwen base + LoRA adapters (Merged)
  • Vision Dependency: Designed to ingest outputs from Salesforce BLIP and EasyOCR.
  • Optimization: DPO + SFT

Uses

Direct Use

  • API-ready content moderation for detecting nuanced hate speech in memes and multimodal posts.
  • Multi-category classification (e.g., Harassment, Racism, Threat, Insult, Sexism).
  • Generating transparent "Chain-of-Thought" reasoning for why a post was flagged.

Downstream Use

  • Integration into Trust & Safety dashboards for social media platforms.
  • Assisting human moderators by pre-filtering and providing contextual explanations for flagged content.

Out-of-Scope Use

  • Fully automated, zero-human-in-the-loop bans for highly ambiguous cases.
  • Medical, legal, or strictly unimodal text classification without proper prompt formatting.

Bias, Risks, and Limitations

The "Safety Tax" / Safe-Bias: Due to the base RLHF alignment and subsequent Direct Preference Optimization (DPO), ToxSense exhibits a strong "safe-bias." It requires a very high threshold of proof to classify content as hateful. While this lowers the overall recall, it was an intentional product design choice to achieve 83% Precision, thereby drastically minimizing false-positive user bans.

How to Get Started with the Model

ToxSense requires a strictly formatted JSON input containing ocr_text, image_caption (extracted via BLIP), and base toxicity_scores.

import json
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "yugbirla/ToxSense-json-ultimate"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# 1. Prepare the JSON Payload
input_data = {
    "ocr_text": "Look at this completely normal picture.",
    "image_caption": "A controversial political figure.",
    "toxicity_scores": {"safe": 0.9, "hate": 0.1}
}

sys_msg = (
    "You are ToxSense, a highly intelligent safety moderator. "
    "You will receive input as a JSON object containing 'ocr_text', 'image_caption', and 'toxicity_scores'. "
    "Think step-by-step. Analyze the contrast between the text and the image. "
    "Classify the input into exactly ONE of these categories: "
    "[safe, racism, sexism, threat, harassment, insult]. "
    "Output JSON ONLY in this format: {\"reasoning\": \"your short analysis\", \"category\": \"<category_name>\"}"
)

messages = [
    {"role": "system", "content": sys_msg}, 
    {"role": "user", "content": json.dumps(input_data, indent=2)}
]

# 2. Generate
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=150)

print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yugbirla/toxsense-json-ultimate

Finetuned
(2360)
this model

Space using yugbirla/toxsense-json-ultimate 1