- SCYTH-5 Model: Quantized Japanese Topic Classifier with Pentagram Aggregation
- π Training Data & Targets
- ποΈ Training Corpus
- ποΈ Architecture
- π§© Feature Aggregation
- π Model Checkpoint Structure
- π Inference Example (PyTorch, CPU)
- π§ Installation
- π License & Ethics
- For technical questions or integration support, contact:
π§ [jhgudleik@gmail.com]
- π References
- tags:
- nlp
- japanese
- multilabel
- pytorch
- quantization
- cpu
- semantic-search
- wikipedia
- knowledge-graph
- embedding
SCYTH-5 Model: Quantized Japanese Topic Classifier with Pentagram Aggregation
π Overview
This is a quantized PyTorch model (scyth_5_int8.pth) designed for multi-label topic classification of short Japanese texts using sparse trigram features with pentagram (5-gram) aggregation. The model is dynamically quantized to INT8 (FBGEMM backend) for high-speed CPU inference while maintaining accuracy.
Key Features:
- Dynamic INT8 quantization (minimal accuracy loss)
- Trigram + Pentagram (5-gram) aggregation for robustness to data sparsity
- Memory-efficient (optimized for 12GB+ RAM CPUs)
- Multi-label classification (11,883 categories)
- FBGEMM acceleration for faster CPU inference
π Training Data & Targets
ποΈ Training Corpus
Wikipedia Dump
- Source: Japanese Wikipedia (Wikimedia dump)
- Coverage: 10% (0.1x) of the full Japanese Wikipedia as of the model's training epoch.
- Content: All Wikipedia pages in Japanese, including:
- Main articles
- Category pages
- Textual metadata (excluding images, templates, and non-textual markup)
Data Statistics (Estimated)
| Metric | Value |
|---|---|
| Number of pages | ~50,127 |
| Unique trigrams | 2,033,473 |
| Categories | 11,883 |
ποΈ Architecture
Model Components
| Component | Details |
|---|---|
| Input | Sparse vector (2,033,473 trigram features) |
| Encoder | Autoencoder (Linear layer) |
| Latent Space | Size: 1024 |
| Bottleneck | Linker size: 512 (reduces dimensionality) |
| Output Head | Multi-label classifier (11,883 categories) |
| Dropout | 0.2 (applied to latent space) |
| Quantization | Dynamic INT8 (FBGEMM backend) |
π§© Feature Aggregation
The model uses pentagram (5-gram) aggregation for inference to improve coverage and robustness:
- Each trigram is mapped not only directly but also via associated pentagrams, which act as latent collocational groups.
- Example: The trigram
"ζΊδΏθͺ¬"(ID: 3621) may be expanded via quasi-collocational pentagrams like"γθͺζΊδΏθͺ¬"(ID: 1946).
Pentagram Expansion
How It Works
- Direct Trigram Matching
- Input text is split into overlapping trigrams.
- Each trigram is mapped to its dictionary ID if present (
"white"trigrams get full weight).
- Pentagram (5-gram) Expansion
- For each trigram, the model finds associated pentagrams (5-grams) that contain it.
- The best-matching pentagram is selected based on the frequency of its constituent trigrams in the input.
- All "white" trigrams from the selected pentagram are added to the feature vector with weighted scores.
Example: "ζΊδΏθͺ¬"
| Component | ID | Score | Role |
|---|---|---|---|
| Direct Trigram | 3621 | 18.0 | "ζΊδΏθͺ¬" (full weight) |
| Pentagram 1 | 1946 | - | "γθͺζΊδΏθͺ¬" |
ββ γθͺζΊ |
380 | 14.3 | (weighted) |
ββ θͺζΊδΏ |
1098 | 15.7 | (weighted) |
ββ ζΊδΏθͺ¬ |
3621 | 18.0 | (redundant, already matched) |
| Pentagram 2 | 1947 | - | "θͺζΊδΏθͺ¬γ" |
ββ δΏθͺ¬γ |
845 | 12.1 | (weighted) |
ββ θͺζΊδΏ |
1098 | 15.7 | (weighted) |
ββ ζΊδΏθͺ¬ |
3621 | 18.0 | (redundant) |
Weight Calculation
Trigram Weight Calculation
The semantic relevance score (score) for trigrams combines:
- IDF-based rarity (normalized to
[0, 1]). - Character composition weight (predefined per trigram, based on
get_weight()):
def get_weight(char):
if '\u4e00' <= char <= '\u9faf': # kanji
return 6
elif '\u3040' <= char <= '\u309F': # hiragana
return 2
elif '\u30A0' <= char <= '\u30FF': # katakana
return 2
elif 'a' <= char <= 'z' or 'A' <= char <= 'Z':
return 1
elif '0' <= char <= '9':
return 1
else:
return 0
Key Definitions
weight(fromget_weight): A precomputed score for each trigram, determined by its character types (kanji, hiragana, katakana, etc.). Example:
- `'ζ₯ζ¬θͺ'` (kanji) β `6+6+6=18`.
- `'γγγ'` (hiragana) β `2+2+2=6`.
- `'abc'` (Latin) β `1+1+1=3`.
IDF(Inverse Document Frequency): Measures how rare a trigram is across all articles (higher = more distinctive). Normalized to[0, 1]via min-max scaling.
Score Formula
The combined score for each trigram is calculated as:
score = Ξ± * normalized_IDF + (weight * Ξ²)
Where:
| Parameter | Value | Purpose |
|---|---|---|
Ξ± = 12 |
12 | Scales IDF importance (semantic distinctiveness). |
Ξ² = 1/3 |
~0.333 | Scales character-type weight (e.g., kanji trigrams get higher scores). |
normalized_IDF |
[0, 1] |
(IDF β IDF_min) / (IDF_max β IDF_min). |
weight |
3β18 |
Sum of get_weight() for each character in the trigram. |
Example Calculation:
| Trigram | Characters | weight |
IDF (Normalized) |
score = 12*IDF + (weight * 1/3) |
|---|---|---|---|---|
'ζ₯ζ¬θͺ' |
6+6+6 (kanji) | 18 | 0.9 | 12*0.9 + (18*0.333) β 10.8 + 6 = 16.8 |
'γγγ' |
2+2+2 (hiragana) | 6 | 0.2 | 12*0.2 + (6*0.333) β 2.4 + 2 = 4.4 |
'abc' |
1+1+1 (Latin) | 3 | 0.5 | 12*0.5 + (3*0.333) β 6 + 1 = 7 |
Filtering Trigrams
Direct Trigrams (
status_dic[tid] == True): Automatically included with full weight (1.0) in the feature matrix. No score calculation needed; treated as high-confidence seeds.Aggregated (Gray) Trigrams Included only if:
score < score_threshold(default:18) and- They appear in a pentagram (5-trigram window) containing at least one direct (white) trigram.
(Example: A window like
[white, gray, gray]or[gray, white, gray]or[gray, gray, white]β all trigrams in the window are included, even if gray trigrams individually fail thescore β₯ 18threshold.)
π Model Checkpoint Structure
The .pth file contains:
{
"model": <PyTorch quantized model>,
"metadata": {
"trigram2col": { <trigram_id>: <column_index> },
"idx2name": { <category_id>: <category_name> }
},
"agrs": {
"status_dic": { <trigram_id>: <is_white> },
"pairs_list": [ (<trigram_id>, <pentagram_id>), ... ],
"ngrm_to_tri": { <pentagram_id>: [<trigram1>, <trigram2>, ...] },
"score_dic": { <trigram_id>: <score> }
},
"config": {
"input_dim": 2033473,
"latent_size": 1024,
"linker_size": 512,
"num_categories": 11883
}
}
Key AGR Dictionaries
| Dictionary | Description |
|---|---|
status_dic |
Flags trigrams as "white" (valid) or "gray" (aggregated via pentagrams). |
pairs_list |
Maps trigram IDs β pentagram IDs (e.g., (3621, 1946) means "ζΊδΏθͺ¬" belongs to "γθͺζΊδΏθͺ¬"). |
ngrm_to_tri |
Maps pentagram IDs β constituent trigrams (e.g., {1946: [380, 1098, 3621]}). |
score_dic |
Corpus-based scores for trigrams (used for weighted aggregation). |
π Inference Example (PyTorch, CPU)
1. Load Model
import torch
checkpoint = torch.load("scyth_5_cpu_int8.pth", map_location="cpu")
model = checkpoint["model"]
model.eval()
conf = checkpoint["config"]
meta = checkpoint["metadata"]
2. Generate Embedding (Full Pipeline)
def generate_embedding(
text, model_name=MODEL_NAME, output_embedding_file=NEW_VECTOR_NAME):
if not text or not text.strip():
return None
text = text.lower()
model, conf, meta, checkpoint = load_quantized_model(model_name)
(
status_dic,
_,
ngrm_to_tri,
_,
text_to_id_aux,
_,
score_dic,
tri_to_ngrm
) = load_dictionaries(checkpoint)
input_vec = torch.zeros(
(1, conf["input_dim"]),
dtype=torch.float32
)
total_trigrams_in_text = max(
0,
len(text) - 2
)
seen_white_tris = set()
text_trigrams_ids = []
# =====================================================
# TRIGRAMS
# =====================================================
for i in range(total_trigrams_in_text):
tid = text_to_id_aux.get(
text[i:i+3]
)
if tid is not None:
text_trigrams_ids.append(tid)
counts_in_text = Counter(
text_trigrams_ids
)
# =====================================================
# DIRECT
# =====================================================
for tid in set(text_trigrams_ids):
if status_dic.get(tid) is True:
if tid not in seen_white_tris:
if tid in meta["trigram2col"]:
seen_white_tris.add(tid)
col = meta["trigram2col"][tid]
input_vec[0, col] = 1.0
# =====================================================
# AGR
# =====================================================
processed_input_tids = set()
for i in range(total_trigrams_in_text):
tri_str = text[i:i+3]
t_id = text_to_id_aux.get(tri_str)
if (t_id is not None and t_id in tri_to_ngrm and t_id not in processed_input_tids):
processed_input_tids.add(t_id)
h_ids = tri_to_ngrm[t_id]
if not h_ids:
continue
def get_ngrm_score(h_idx):
tris_in_ngrm = ngrm_to_tri.get(
h_idx,
[]
)
return sum(
counts_in_text.get(s_tid, 0)
for s_tid in tris_in_ngrm
)
best_h_id = max(
h_ids,
key=get_ngrm_score
)
for s_tri_id in ngrm_to_tri.get(
best_h_id,
[]
):
if (
status_dic.get(s_tri_id) is True
and s_tri_id not in seen_white_tris
):
if s_tri_id in meta["trigram2col"]:
seen_white_tris.add(s_tri_id)
col = meta["trigram2col"][s_tri_id]
raw_score = score_dic.get(
s_tri_id,
0.0
)
normalized_score = max(
0.0,
min(
1.0,
raw_score / 18.0
)
)
final_weight = (
normalized_score
* GRAY_WEIGHT
)
input_vec[0, col] = final_weight
# =====================================================
# MODEL
# =====================================================
with torch.no_grad():
_, _, linker_embedding = model(input_vec)
embedding_vec = (
linker_embedding
.squeeze(0)
.float()
.cpu()
.numpy()
)
return embedding_vec
π¦ Requirements
| Dependency | Version |
|---|---|
| Python | 3.11.9 |
torch |
2.5.1 |
torchaudio |
2.5.1 |
torchvision |
0.20.1 |
scipy |
1.13.1 |
numpy |
1.26.4 |
psutil |
6.0.0 |
gradio |
5.23.0 |
huggingface-hub |
0.36.0 |
pandas |
2.2.3 |
python-dateutil |
2.9.0.post0 |
pillow |
10.4.0 |
gradio (optional) |
Latest |
π₯οΈ Training Hardware
This model was fine-tuned using SCYTH Cyberia, a high-performance computing cluster with the following node specifications:
- CPU: 28 cores (14 logical cores via Hyper-Threading)
- Memory: 251 GB DDR4 RAM
- Storage: 915 GB SATA HDDs (RAID-configured for speed)
- Accelerators: 4 Γ NVIDIA Tesla K80 GPUs (11.4 GB GDDR5 VRAM each)
Why this setup?
- Massive parallel processing for large-scale language model training.
- High memory capacity to handle large batch sizes.
π» CPU Hardware
- Minimum: 12GB RAM (16GB+ recommended for batch processing).
- Quantization: Uses FBGEMM backend for optimized CPU inference.
π§ Installation
pip install numpy==1.26.4 torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 psutil==6.0.0 pandas==2.2.3 scipy==1.13.1
πΈ SCYTH-J: Japanese Text Classifier Demo
πΉ GitHub
What it does:
A demo app for classifying Japanese text into user-defined categories using a pre-trained quantized model (published on Hugging Face). Simply input text or upload files, and the app returns the most relevant topics with confidence scores.
How it works:
- Input Japanese text β The demo embeds it using the quantized
scyth_5_cpu_int8model. - Compare against custom categories (pre-loaded or added by you).
- Rank matches by semantic similarity (cosine distance).
β¨ Try it now! Add your own samples to expand the categorization.
π License & Ethics
License
Usage Guidelines
- Research/Education: Free to use.
- Commercial Use: Requires explicit permission from the author.
- Ethical Use: Must comply with AI ethics standards.
- Reproducibility: Cite the model appropriately if used in publications.
For technical questions or integration support, contact: π§ [jhgudleik@gmail.com]
π References
- Dynamic Quantization: PyTorch Quantization Docs
- FBGEMM: Facebook Research's Gemm Library
tags: - nlp - japanese - multilabel - pytorch - quantization - cpu - semantic-search - wikipedia - knowledge-graph - embedding
Last Updated: June 2026
- Downloads last month
- 182