SCYTH-5 Model: Quantized Japanese Topic Classifier with Pentagram Aggregation

License: MIT PyTorch CPU-Only

Model Card

πŸ“‹ Overview

This is a quantized PyTorch model (scyth_5_int8.pth) designed for multi-label topic classification of short Japanese texts using sparse trigram features with pentagram (5-gram) aggregation. The model is dynamically quantized to INT8 (FBGEMM backend) for high-speed CPU inference while maintaining accuracy.

Key Features:

  • Dynamic INT8 quantization (minimal accuracy loss)
  • Trigram + Pentagram (5-gram) aggregation for robustness to data sparsity
  • Memory-efficient (optimized for 12GB+ RAM CPUs)
  • Multi-label classification (11,883 categories)
  • FBGEMM acceleration for faster CPU inference

πŸ“Š Training Data & Targets

πŸ—ƒοΈ Training Corpus

Wikipedia Dump

  • Source: Japanese Wikipedia (Wikimedia dump)
  • Coverage: 10% (0.1x) of the full Japanese Wikipedia as of the model's training epoch.
  • Content: All Wikipedia pages in Japanese, including:
    • Main articles
    • Category pages
    • Textual metadata (excluding images, templates, and non-textual markup)

Data Statistics (Estimated)

Metric Value
Number of pages ~50,127
Unique trigrams 2,033,473
Categories 11,883

πŸ—οΈ Architecture

Model Components

Component Details
Input Sparse vector (2,033,473 trigram features)
Encoder Autoencoder (Linear layer)
Latent Space Size: 1024
Bottleneck Linker size: 512 (reduces dimensionality)
Output Head Multi-label classifier (11,883 categories)
Dropout 0.2 (applied to latent space)
Quantization Dynamic INT8 (FBGEMM backend)

🧩 Feature Aggregation

The model uses pentagram (5-gram) aggregation for inference to improve coverage and robustness:

  • Each trigram is mapped not only directly but also via associated pentagrams, which act as latent collocational groups.
  • Example: The trigram "源俗θͺ¬" (ID: 3621) may be expanded via quasi-collocational pentagrams like "たθͺžζΊδΏ—θͺ¬" (ID: 1946).

Pentagram Expansion

How It Works

  1. Direct Trigram Matching
    • Input text is split into overlapping trigrams.
    • Each trigram is mapped to its dictionary ID if present ("white" trigrams get full weight).
  2. Pentagram (5-gram) Expansion
    • For each trigram, the model finds associated pentagrams (5-grams) that contain it.
    • The best-matching pentagram is selected based on the frequency of its constituent trigrams in the input.
    • All "white" trigrams from the selected pentagram are added to the feature vector with weighted scores.

Example: "源俗θͺ¬"

Component ID Score Role
Direct Trigram 3621 18.0 "源俗θͺ¬" (full weight)
Pentagram 1 1946 - "たθͺžζΊδΏ—θͺ¬"
   β”œβ”€ たθͺžζΊ 380 14.3 (weighted)
   β”œβ”€ θͺžζΊδΏ— 1098 15.7 (weighted)
   β””─ 源俗θͺ¬ 3621 18.0 (redundant, already matched)
Pentagram 2 1947 - "θͺžζΊδΏ—θͺ¬γŒ"
   β”œβ”€ δΏ—θͺ¬γŒ 845 12.1 (weighted)
   β”œβ”€ θͺžζΊδΏ— 1098 15.7 (weighted)
   β””─ 源俗θͺ¬ 3621 18.0 (redundant)

Weight Calculation

Trigram Weight Calculation

The semantic relevance score (score) for trigrams combines:

  1. IDF-based rarity (normalized to [0, 1]).
  2. Character composition weight (predefined per trigram, based on get_weight()):
def get_weight(char):
    if '\u4e00' <= char <= '\u9faf':  # kanji
        return 6
    elif '\u3040' <= char <= '\u309F':  # hiragana
        return 2
    elif '\u30A0' <= char <= '\u30FF':  # katakana
        return 2
    elif 'a' <= char <= 'z' or 'A' <= char <= 'Z':
        return 1
    elif '0' <= char <= '9':
        return 1
    else:
        return 0

Key Definitions

  • weight (from get_weight): A precomputed score for each trigram, determined by its character types (kanji, hiragana, katakana, etc.). Example:
  - `'ζ—₯本θͺž'` (kanji) β†’ `6+6+6=18`.
  - `'あいう'` (hiragana) β†’ `2+2+2=6`.
  - `'abc'` (Latin) β†’ `1+1+1=3`.
  • IDF (Inverse Document Frequency): Measures how rare a trigram is across all articles (higher = more distinctive). Normalized to [0, 1] via min-max scaling.

Score Formula

The combined score for each trigram is calculated as:

score = Ξ± * normalized_IDF + (weight * Ξ²)

Where:

Parameter Value Purpose
Ξ± = 12 12 Scales IDF importance (semantic distinctiveness).
Ξ² = 1/3 ~0.333 Scales character-type weight (e.g., kanji trigrams get higher scores).
normalized_IDF [0, 1] (IDF – IDF_min) / (IDF_max – IDF_min).
weight 3–18 Sum of get_weight() for each character in the trigram.

Example Calculation:

Trigram Characters weight IDF (Normalized) score = 12*IDF + (weight * 1/3)
'ζ—₯本θͺž' 6+6+6 (kanji) 18 0.9 12*0.9 + (18*0.333) β‰ˆ 10.8 + 6 = 16.8
'あいう' 2+2+2 (hiragana) 6 0.2 12*0.2 + (6*0.333) β‰ˆ 2.4 + 2 = 4.4
'abc' 1+1+1 (Latin) 3 0.5 12*0.5 + (3*0.333) β‰ˆ 6 + 1 = 7

Filtering Trigrams

  • Direct Trigrams (status_dic[tid] == True): Automatically included with full weight (1.0) in the feature matrix. No score calculation needed; treated as high-confidence seeds.

  • Aggregated (Gray) Trigrams Included only if:

    1. score < score_threshold (default: 18) and
    2. They appear in a pentagram (5-trigram window) containing at least one direct (white) trigram. (Example: A window like [white, gray, gray] or [gray, white, gray] or [gray, gray, white] β†’ all trigrams in the window are included, even if gray trigrams individually fail the score β‰₯ 18 threshold.)

πŸ“‚ Model Checkpoint Structure

The .pth file contains:

{
  "model": <PyTorch quantized model>,
  "metadata": {
    "trigram2col": { <trigram_id>: <column_index> },
    "idx2name": { <category_id>: <category_name> }
  },
  "agrs": {
    "status_dic": { <trigram_id>: <is_white> },
    "pairs_list": [ (<trigram_id>, <pentagram_id>), ... ],
    "ngrm_to_tri": { <pentagram_id>: [<trigram1>, <trigram2>, ...] },
    "score_dic": { <trigram_id>: <score> }
  },
  "config": {
    "input_dim": 2033473,
    "latent_size": 1024,
    "linker_size": 512,
    "num_categories": 11883
  }
}

Key AGR Dictionaries

Dictionary Description
status_dic Flags trigrams as "white" (valid) or "gray" (aggregated via pentagrams).
pairs_list Maps trigram IDs β†’ pentagram IDs (e.g., (3621, 1946) means "源俗θͺ¬" belongs to "たθͺžζΊδΏ—θͺ¬").
ngrm_to_tri Maps pentagram IDs β†’ constituent trigrams (e.g., {1946: [380, 1098, 3621]}).
score_dic Corpus-based scores for trigrams (used for weighted aggregation).

πŸš€ Inference Example (PyTorch, CPU)

1. Load Model

import torch

checkpoint = torch.load("scyth_5_cpu_int8.pth", map_location="cpu")
model = checkpoint["model"]
model.eval()
conf = checkpoint["config"]
meta = checkpoint["metadata"]

2. Generate Embedding (Full Pipeline)

def generate_embedding(
    text, model_name=MODEL_NAME, output_embedding_file=NEW_VECTOR_NAME):

    if not text or not text.strip():
        return None

    text = text.lower()
    model, conf, meta, checkpoint = load_quantized_model(model_name)
    (
        status_dic,
        _,
        ngrm_to_tri,
        _,
        text_to_id_aux,
        _,
        score_dic,
        tri_to_ngrm
    ) = load_dictionaries(checkpoint)
    input_vec = torch.zeros(
        (1, conf["input_dim"]),
        dtype=torch.float32
    )
    total_trigrams_in_text = max(
        0,
        len(text) - 2
    )
    seen_white_tris = set()
    text_trigrams_ids = []
    # =====================================================
    # TRIGRAMS
    # =====================================================
    for i in range(total_trigrams_in_text):
        tid = text_to_id_aux.get(
            text[i:i+3]
        )
        if tid is not None:
            text_trigrams_ids.append(tid)

    counts_in_text = Counter(
        text_trigrams_ids
    )
    # =====================================================
    # DIRECT
    # =====================================================
    for tid in set(text_trigrams_ids):
        if status_dic.get(tid) is True:
            if tid not in seen_white_tris:
                if tid in meta["trigram2col"]:
                    seen_white_tris.add(tid)
                    col = meta["trigram2col"][tid]
                    input_vec[0, col] = 1.0
    # =====================================================
    # AGR
    # =====================================================
    processed_input_tids = set()
    for i in range(total_trigrams_in_text):
        tri_str = text[i:i+3]
        t_id = text_to_id_aux.get(tri_str)

        if (t_id is not None and t_id in tri_to_ngrm and t_id not in processed_input_tids):

            processed_input_tids.add(t_id)
            h_ids = tri_to_ngrm[t_id]

            if not h_ids:
                continue

            def get_ngrm_score(h_idx):
                tris_in_ngrm = ngrm_to_tri.get(
                    h_idx,
                    []
                )
                return sum(
                    counts_in_text.get(s_tid, 0)
                    for s_tid in tris_in_ngrm
                )
            best_h_id = max(
                h_ids,
                key=get_ngrm_score
            )
            for s_tri_id in ngrm_to_tri.get(
                best_h_id,
                []
            ):
                if (
                    status_dic.get(s_tri_id) is True
                    and s_tri_id not in seen_white_tris
                ):
                    if s_tri_id in meta["trigram2col"]:
                        seen_white_tris.add(s_tri_id)
                        col = meta["trigram2col"][s_tri_id]
                        raw_score = score_dic.get(
                            s_tri_id,
                            0.0
                        )
                        normalized_score = max(
                            0.0,
                            min(
                                1.0,
                                raw_score / 18.0
                            )
                        )
                        final_weight = (
                            normalized_score
                            * GRAY_WEIGHT
                        )
                        input_vec[0, col] = final_weight
    # =====================================================
    # MODEL
    # =====================================================
    with torch.no_grad():
        _, _, linker_embedding = model(input_vec)
        embedding_vec = (
            linker_embedding
            .squeeze(0)
            .float()
            .cpu()
            .numpy()
        )
    return embedding_vec

πŸ“¦ Requirements

Dependency Version
Python 3.11.9
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
scipy 1.13.1
numpy 1.26.4
psutil 6.0.0
gradio 5.23.0
huggingface-hub 0.36.0
pandas 2.2.3
python-dateutil 2.9.0.post0
pillow 10.4.0
gradio (optional) Latest

πŸ–₯️ Training Hardware

This model was fine-tuned using SCYTH Cyberia, a high-performance computing cluster with the following node specifications:

  • CPU: 28 cores (14 logical cores via Hyper-Threading)
  • Memory: 251 GB DDR4 RAM
  • Storage: 915 GB SATA HDDs (RAID-configured for speed)
  • Accelerators: 4 Γ— NVIDIA Tesla K80 GPUs (11.4 GB GDDR5 VRAM each)

Why this setup?

  • Massive parallel processing for large-scale language model training.
  • High memory capacity to handle large batch sizes.

πŸ’» CPU Hardware

  • Minimum: 12GB RAM (16GB+ recommended for batch processing).
  • Quantization: Uses FBGEMM backend for optimized CPU inference.

πŸ”§ Installation

pip install numpy==1.26.4 torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 psutil==6.0.0 pandas==2.2.3 scipy==1.13.1

🌸 SCYTH-J: Japanese Text Classifier Demo

🏹 GitHub

What it does:

A demo app for classifying Japanese text into user-defined categories using a pre-trained quantized model (published on Hugging Face). Simply input text or upload files, and the app returns the most relevant topics with confidence scores.

How it works:

  1. Input Japanese text β†’ The demo embeds it using the quantized scyth_5_cpu_int8 model.
  2. Compare against custom categories (pre-loaded or added by you).
  3. Rank matches by semantic similarity (cosine distance).

✨ Try it now! Add your own samples to expand the categorization.


πŸ“œ License & Ethics

License

MIT License

Usage Guidelines

  • Research/Education: Free to use.
  • Commercial Use: Requires explicit permission from the author.
  • Ethical Use: Must comply with AI ethics standards.
  • Reproducibility: Cite the model appropriately if used in publications.

For technical questions or integration support, contact: πŸ“§ [jhgudleik@gmail.com]

πŸ“š References


tags: - nlp - japanese - multilabel - pytorch - quantization - cpu - semantic-search - wikipedia - knowledge-graph - embedding

Last Updated: June 2026

Downloads last month
182
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support