SCYTH-5 Model: Quantized Japanese Topic Classifier with Pentagram Aggregation

📋 Overview

This is a quantized PyTorch model (scyth_5_int8.pth) designed for multi-label topic classification of short Japanese texts using sparse trigram features with pentagram (5-gram) aggregation. The model is dynamically quantized to INT8 (FBGEMM backend) for high-speed CPU inference while maintaining accuracy.

Key Features:

Dynamic INT8 quantization (minimal accuracy loss)
Trigram + Pentagram (5-gram) aggregation for robustness to data sparsity
Memory-efficient (optimized for 12GB+ RAM CPUs)
Multi-label classification (11,883 categories)
FBGEMM acceleration for faster CPU inference

📊 Training Data & Targets

🗃️ Training Corpus

Wikipedia Dump

Source: Japanese Wikipedia (Wikimedia dump)
Coverage: 10% (0.1x) of the full Japanese Wikipedia as of the model's training epoch.
Content: All Wikipedia pages in Japanese, including:
- Main articles
- Category pages
- Textual metadata (excluding images, templates, and non-textual markup)

Data Statistics (Estimated)

Metric	Value
Number of pages	~50,127
Unique trigrams	2,033,473
Categories	11,883

🏗️ Architecture

Model Components

Component	Details
Input	Sparse vector (2,033,473 trigram features)
Encoder	Autoencoder (Linear layer)
Latent Space	Size: 1024
Bottleneck	Linker size: 512 (reduces dimensionality)
Output Head	Multi-label classifier (11,883 categories)
Dropout	0.2 (applied to latent space)
Quantization	Dynamic INT8 (FBGEMM backend)

🧩 Feature Aggregation

The model uses pentagram (5-gram) aggregation for inference to improve coverage and robustness:

Each trigram is mapped not only directly but also via associated pentagrams, which act as latent collocational groups.
Example: The trigram "源俗説" (ID: 3621) may be expanded via quasi-collocational pentagrams like "た語源俗説" (ID: 1946).

Pentagram Expansion

How It Works

Direct Trigram Matching
- Input text is split into overlapping trigrams.
- Each trigram is mapped to its dictionary ID if present ("white" trigrams get full weight).
Pentagram (5-gram) Expansion
- For each trigram, the model finds associated pentagrams (5-grams) that contain it.
- The best-matching pentagram is selected based on the frequency of its constituent trigrams in the input.
- All "white" trigrams from the selected pentagram are added to the feature vector with weighted scores.

Example: `"源俗説"`

Component	ID	Score	Role
Direct Trigram	3621	18.0	`"源俗説"` (full weight)
Pentagram 1	1946	-	`"た語源俗説"`
├─ `た語源`	380	14.3	(weighted)
├─ `語源俗`	1098	15.7	(weighted)
└─ `源俗説`	3621	18.0	(redundant, already matched)
Pentagram 2	1947	-	`"語源俗説が"`
├─ `俗説が`	845	12.1	(weighted)
├─ `語源俗`	1098	15.7	(weighted)
└─ `源俗説`	3621	18.0	(redundant)

Weight Calculation

Trigram Weight Calculation

The semantic relevance score (score) for trigrams combines:

IDF-based rarity (normalized to [0, 1]).
Character composition weight (predefined per trigram, based on get_weight()):

def get_weight(char):
    if '\u4e00' <= char <= '\u9faf':  # kanji
        return 6
    elif '\u3040' <= char <= '\u309F':  # hiragana
        return 2
    elif '\u30A0' <= char <= '\u30FF':  # katakana
        return 2
    elif 'a' <= char <= 'z' or 'A' <= char <= 'Z':
        return 1
    elif '0' <= char <= '9':
        return 1
    else:
        return 0

Key Definitions

weight (from get_weight): A precomputed score for each trigram, determined by its character types (kanji, hiragana, katakana, etc.). Example:

  - `'日本語'` (kanji) → `6+6+6=18`.
  - `'あいう'` (hiragana) → `2+2+2=6`.
  - `'abc'` (Latin) → `1+1+1=3`.

IDF (Inverse Document Frequency): Measures how rare a trigram is across all articles (higher = more distinctive). Normalized to [0, 1] via min-max scaling.

Score Formula

The combined score for each trigram is calculated as:

score = α * normalized_IDF + (weight * β)

Where:

Parameter	Value	Purpose
`α = 12`	12	Scales IDF importance (semantic distinctiveness).
`β = 1/3`	~0.333	Scales character-type weight (e.g., kanji trigrams get higher scores).
`normalized_IDF`	`[0, 1]`	`(IDF – IDF_min) / (IDF_max – IDF_min)`.
`weight`	`3–18`	Sum of `get_weight()` for each character in the trigram.

Example Calculation:

Trigram	Characters	`weight`	`IDF` (Normalized)	`score = 12IDF + (weight 1/3)`
`'日本語'`	6+6+6 (kanji)	18	0.9	`120.9 + (180.333) ≈ 10.8 + 6 = 16.8`
`'あいう'`	2+2+2 (hiragana)	6	0.2	`120.2 + (60.333) ≈ 2.4 + 2 = 4.4`
`'abc'`	1+1+1 (Latin)	3	0.5	`120.5 + (30.333) ≈ 6 + 1 = 7`

Filtering Trigrams

Direct Trigrams (status_dic[tid] == True): Automatically included with full weight (1.0) in the feature matrix. No score calculation needed; treated as high-confidence seeds.
Aggregated (Gray) Trigrams Included only if:
1. score < score_threshold (default: 18) and
2. They appear in a pentagram (5-trigram window) containing at least one direct (white) trigram. (Example: A window like [white, gray, gray] or [gray, white, gray] or [gray, gray, white] → all trigrams in the window are included, even if gray trigrams individually fail the score ≥ 18 threshold.)

📂 Model Checkpoint Structure

The .pth file contains:

{
  "model": <PyTorch quantized model>,
  "metadata": {
    "trigram2col": { <trigram_id>: <column_index> },
    "idx2name": { <category_id>: <category_name> }
  },
  "agrs": {
    "status_dic": { <trigram_id>: <is_white> },
    "pairs_list": [ (<trigram_id>, <pentagram_id>), ... ],
    "ngrm_to_tri": { <pentagram_id>: [<trigram1>, <trigram2>, ...] },
    "score_dic": { <trigram_id>: <score> }
  },
  "config": {
    "input_dim": 2033473,
    "latent_size": 1024,
    "linker_size": 512,
    "num_categories": 11883
  }
}

Key AGR Dictionaries

Dictionary	Description
`status_dic`	Flags trigrams as `"white"` (valid) or `"gray"` (aggregated via pentagrams).
`pairs_list`	Maps trigram IDs → pentagram IDs (e.g., `(3621, 1946)` means `"源俗説"` belongs to `"た語源俗説"`).
`ngrm_to_tri`	Maps pentagram IDs → constituent trigrams (e.g., `{1946: [380, 1098, 3621]}`).
`score_dic`	Corpus-based scores for trigrams (used for weighted aggregation).

🚀 Inference Example (PyTorch, CPU)

1. Load Model

import torch

checkpoint = torch.load("scyth_5_cpu_int8.pth", map_location="cpu")
model = checkpoint["model"]
model.eval()
conf = checkpoint["config"]
meta = checkpoint["metadata"]

2. Generate Embedding (Full Pipeline)

def generate_embedding(
    text, model_name=MODEL_NAME, output_embedding_file=NEW_VECTOR_NAME):

    if not text or not text.strip():
        return None

    text = text.lower()
    model, conf, meta, checkpoint = load_quantized_model(model_name)
    (
        status_dic,
        _,
        ngrm_to_tri,
        _,
        text_to_id_aux,
        _,
        score_dic,
        tri_to_ngrm
    ) = load_dictionaries(checkpoint)
    input_vec = torch.zeros(
        (1, conf["input_dim"]),
        dtype=torch.float32
    )
    total_trigrams_in_text = max(
        0,
        len(text) - 2
    )
    seen_white_tris = set()
    text_trigrams_ids = []
    # =====================================================
    # TRIGRAMS
    # =====================================================
    for i in range(total_trigrams_in_text):
        tid = text_to_id_aux.get(
            text[i:i+3]
        )
        if tid is not None:
            text_trigrams_ids.append(tid)

    counts_in_text = Counter(
        text_trigrams_ids
    )
    # =====================================================
    # DIRECT
    # =====================================================
    for tid in set(text_trigrams_ids):
        if status_dic.get(tid) is True:
            if tid not in seen_white_tris:
                if tid in meta["trigram2col"]:
                    seen_white_tris.add(tid)
                    col = meta["trigram2col"][tid]
                    input_vec[0, col] = 1.0
    # =====================================================
    # AGR
    # =====================================================
    processed_input_tids = set()
    for i in range(total_trigrams_in_text):
        tri_str = text[i:i+3]
        t_id = text_to_id_aux.get(tri_str)

        if (t_id is not None and t_id in tri_to_ngrm and t_id not in processed_input_tids):

            processed_input_tids.add(t_id)
            h_ids = tri_to_ngrm[t_id]

            if not h_ids:
                continue

            def get_ngrm_score(h_idx):
                tris_in_ngrm = ngrm_to_tri.get(
                    h_idx,
                    []
                )
                return sum(
                    counts_in_text.get(s_tid, 0)
                    for s_tid in tris_in_ngrm
                )
            best_h_id = max(
                h_ids,
                key=get_ngrm_score
            )
            for s_tri_id in ngrm_to_tri.get(
                best_h_id,
                []
            ):
                if (
                    status_dic.get(s_tri_id) is True
                    and s_tri_id not in seen_white_tris
                ):
                    if s_tri_id in meta["trigram2col"]:
                        seen_white_tris.add(s_tri_id)
                        col = meta["trigram2col"][s_tri_id]
                        raw_score = score_dic.get(
                            s_tri_id,
                            0.0
                        )
                        normalized_score = max(
                            0.0,
                            min(
                                1.0,
                                raw_score / 18.0
                            )
                        )
                        final_weight = (
                            normalized_score
                            * GRAY_WEIGHT
                        )
                        input_vec[0, col] = final_weight
    # =====================================================
    # MODEL
    # =====================================================
    with torch.no_grad():
        _, _, linker_embedding = model(input_vec)
        embedding_vec = (
            linker_embedding
            .squeeze(0)
            .float()
            .cpu()
            .numpy()
        )
    return embedding_vec

📦 Requirements

Dependency	Version
Python	3.11.9
`torch`	2.5.1
`torchaudio`	2.5.1
`torchvision`	0.20.1
`scipy`	1.13.1
`numpy`	1.26.4
`psutil`	6.0.0
`gradio`	5.23.0
`huggingface-hub`	0.36.0
`pandas`	2.2.3
`python-dateutil`	2.9.0.post0
`pillow`	10.4.0
`gradio` (optional)	Latest

🖥️ Training Hardware

This model was fine-tuned using SCYTH Cyberia, a high-performance computing cluster with the following node specifications:

CPU: 28 cores (14 logical cores via Hyper-Threading)
Memory: 251 GB DDR4 RAM
Storage: 915 GB SATA HDDs (RAID-configured for speed)
Accelerators: 4 × NVIDIA Tesla K80 GPUs (11.4 GB GDDR5 VRAM each)

Why this setup?

Massive parallel processing for large-scale language model training.
High memory capacity to handle large batch sizes.

💻 CPU Hardware

Minimum: 12GB RAM (16GB+ recommended for batch processing).
Quantization: Uses FBGEMM backend for optimized CPU inference.

🔧 Installation

pip install numpy==1.26.4 torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 psutil==6.0.0 pandas==2.2.3 scipy==1.13.1

🌸 SCYTH-J: Japanese Text Classifier Demo

🏹 GitHub

What it does:

A demo app for classifying Japanese text into user-defined categories using a pre-trained quantized model (published on Hugging Face). Simply input text or upload files, and the app returns the most relevant topics with confidence scores.

How it works:

Input Japanese text → The demo embeds it using the quantized scyth_5_cpu_int8 model.
Compare against custom categories (pre-loaded or added by you).
Rank matches by semantic similarity (cosine distance).

✨ Try it now! Add your own samples to expand the categorization.

📜 License & Ethics

License

MIT License

Usage Guidelines

Research/Education: Free to use.
Commercial Use: Requires explicit permission from the author.
Ethical Use: Must comply with AI ethics standards.
Reproducibility: Cite the model appropriately if used in publications.

For technical questions or integration support, contact: 📧 [jhgudleik@gmail.com]

📚 References

Dynamic Quantization: PyTorch Quantization Docs
FBGEMM: Facebook Research's Gemm Library

tags: - nlp - japanese - multilabel - pytorch - quantization - cpu - semantic-search - wikipedia - knowledge-graph - embedding

Last Updated: June 2026

Downloads last month: 182