Update README.md

b0bee03 verified 1 day ago

3.51 kB

language: en
license: mit
tags:
  - moderation
  - safety
  - content-moderation
  - transformer
  - chain-of-thought
  - reasoning
library_name: pytorch
pipeline_tag: text-generation
datasets:
  - OnlyCheeini/greesyguard-3-mini-claude-4.6-sonnet-2000x

GreesyGuard (GreesyGPT)

GreesyGuard is a lightweight reasoning-based content moderation model designed to analyze user messages, evaluate harm potential, and produce structured moderation verdicts.

Unlike traditional classifiers, GreesyGuard performs step‑by‑step analysis inside <think> blocks before generating the final moderation decision.

This improves transparency and makes moderation decisions easier to audit.

Model Overview

GreesyGuard is a Transformer model specialized for safety classification tasks such as:

harassment detection
hate speech
spam detection
misinformation identification
crisis detection

Instead of directly outputting a label, the model:

Analyzes the message
Evaluates context and intent
Identifies policy violations
Outputs a final moderation verdict

Moderation Labels

The model produces the following moderation categories:

SAFE
SPAM
MISINFORMATION
HARASSMENT
HATE_SPEECH
CRISIS_REFERRAL
UNSAFE

Example output:

## Verdict
**HARASSMENT**

Model Architecture

Parameter	Value
Layers	12
Heads	12
Embedding Dimension	768
Context Window	12,000 tokens
Tokenizer	o200k_base (extended)
Vocabulary Size	8192

Key architectural features:

Transformer decoder architecture
Rotary Positional Embeddings (RoPE)
KV‑Cache optimized inference
Structured chat‑template training
Markdown reasoning output

Reasoning Modes

The model supports configurable reasoning budgets:

Mode	Think Tokens	Purpose
NONE	200	Fast moderation
LOW	512	Balanced reasoning
MEDIUM	1536	Detailed analysis
HIGH	3072	Maximum review depth

Higher modes produce more thorough moderation reasoning but increase latency.

Example Usage

from model import GreesyGPT, generate_moderation, ReasoningMode, OutputFormat

model = GreesyGPT()

result = generate_moderation(
    model,
    prompt="You're worthless and nobody likes you.",
    mode=ReasoningMode.MEDIUM,
    output_format=OutputFormat.JSON
)

print(result["verdict_fmt"])

Example structured output:

{
  "verdict": "HARASSMENT",
  "severity": 3,
  "confidence_hint": "medium"
}

Training Format

Training data follows a structured conversation template:

<|system|>
moderation instructions
</|system|>

<|user|>
message to review
</|user|>

<|assistant|>
<think>
step-by-step reasoning
</think>

verdict<|endoftext|>

Only assistant tokens contribute to the training loss.

Intended Use

GreesyGuard is designed for:

social media moderation
comment filtering
forum safety pipelines
research in explainable moderation systems

Limitations

The reasoning output may appear confident but still be incorrect.
Sarcasm and cultural context can be misinterpreted.
The model should not be used for fully automated enforcement without human oversight.

Safety

Moderation systems should always include human review for high‑impact actions such as account suspension or legal escalation.

Authors

Created by the GreesyGuard Project

Author: Nicat

GitHub: https://github.com/Nicat-dcw/GreesyGuard