OnlyCheeini's picture
Update README.md
b0bee03 verified
metadata
language: en
license: mit
tags:
  - moderation
  - safety
  - content-moderation
  - transformer
  - chain-of-thought
  - reasoning
library_name: pytorch
pipeline_tag: text-generation
datasets:
  - OnlyCheeini/greesyguard-3-mini-claude-4.6-sonnet-2000x

GreesyGuard (GreesyGPT)

GreesyGuard is a lightweight reasoning-based content moderation model designed to analyze user messages, evaluate harm potential, and produce structured moderation verdicts.

Unlike traditional classifiers, GreesyGuard performs step‑by‑step analysis inside <think> blocks before generating the final moderation decision.

This improves transparency and makes moderation decisions easier to audit.


Model Overview

GreesyGuard is a Transformer model specialized for safety classification tasks such as:

  • harassment detection
  • hate speech
  • spam detection
  • misinformation identification
  • crisis detection

Instead of directly outputting a label, the model:

  1. Analyzes the message
  2. Evaluates context and intent
  3. Identifies policy violations
  4. Outputs a final moderation verdict

Moderation Labels

The model produces the following moderation categories:

SAFE
SPAM
MISINFORMATION
HARASSMENT
HATE_SPEECH
CRISIS_REFERRAL
UNSAFE

Example output:

## Verdict
**HARASSMENT**

Model Architecture

Parameter Value
Layers 12
Heads 12
Embedding Dimension 768
Context Window 12,000 tokens
Tokenizer o200k_base (extended)
Vocabulary Size 8192

Key architectural features:

  • Transformer decoder architecture
  • Rotary Positional Embeddings (RoPE)
  • KV‑Cache optimized inference
  • Structured chat‑template training
  • Markdown reasoning output

Reasoning Modes

The model supports configurable reasoning budgets:

Mode Think Tokens Purpose
NONE 200 Fast moderation
LOW 512 Balanced reasoning
MEDIUM 1536 Detailed analysis
HIGH 3072 Maximum review depth

Higher modes produce more thorough moderation reasoning but increase latency.


Example Usage

from model import GreesyGPT, generate_moderation, ReasoningMode, OutputFormat

model = GreesyGPT()

result = generate_moderation(
    model,
    prompt="You're worthless and nobody likes you.",
    mode=ReasoningMode.MEDIUM,
    output_format=OutputFormat.JSON
)

print(result["verdict_fmt"])

Example structured output:

{
  "verdict": "HARASSMENT",
  "severity": 3,
  "confidence_hint": "medium"
}

Training Format

Training data follows a structured conversation template:

<|system|>
moderation instructions
</|system|>

<|user|>
message to review
</|user|>

<|assistant|>
<think>
step-by-step reasoning
</think>

verdict<|endoftext|>

Only assistant tokens contribute to the training loss.


Intended Use

GreesyGuard is designed for:

  • social media moderation
  • comment filtering
  • forum safety pipelines
  • research in explainable moderation systems

Limitations

  • The reasoning output may appear confident but still be incorrect.
  • Sarcasm and cultural context can be misinterpreted.
  • The model should not be used for fully automated enforcement without human oversight.

Safety

Moderation systems should always include human review for high‑impact actions such as account suspension or legal escalation.


Authors

Created by the GreesyGuard Project

Author: Nicat

GitHub: https://github.com/Nicat-dcw/GreesyGuard