🇹🇭 Thai NLP Toolkit

A multi-task NLP framework for the Thai language built from scratch with PyTorch.

Uses a shared Transformer encoder backbone with three task-specific heads:

Task	Head	Metric
Named Entity Recognition	Token classification (7 labels)	Entity-level F1
Sentiment Analysis	Sentence classification (3 labels)	Macro-F1
Question Answering	Extractive span prediction	EM / F1

Model Architecture

Tokenizer: SentencePiece BPE (32K vocab) with Thai-specific preprocessing
Encoder: 6-layer Transformer (d_model=256, 8 heads, d_ff=1024)
Max sequence length: 512 tokens

Usage

# Clone the repository first
# git clone https://github.com/puttibenz/thai-nlp-toolkit.git

from inference.pipeline import ThaiNLPPipeline

pipeline = ThaiNLPPipeline(model_dir="path/to/downloaded/model", device="auto")

# NER
result = pipeline.predict("สมชายทำงานที่กรุงเทพ", task="ner")

# Sentiment Analysis
result = pipeline.predict("อาหารอร่อยมากครับ", task="sentiment")

# Question Answering
result = pipeline.predict(
    "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย",
    task="qa",
    question="เมืองหลวงของประเทศไทยคืออะไร"
)

Training Data

Dataset	Task	Source
ThaiNER v2.2	NER	`pythainlp/thainer-corpus-v2.2`
Wisesight Sentiment	Sentiment	`pythainlp/wisesight_sentiment`
iApp Thai Wiki QA	QA	`iapp_wiki_qa_squad`

Training Details

Framework: PyTorch (custom implementation)
Training: Multi-task learning with round-robin sampling
Optimizer: AdamW with cosine LR schedule + warmup
Mixed Precision: FP16 on CUDA
Batch Size: 32 (×4 gradient accumulation = effective 128)

File Structure

thai-nlp-toolkit/
├── checkpoint.pt              # Model weights
├── config.yaml                # Model architecture config
└── tokenizer/
    ├── thai_bpe.model         # SentencePiece BPE model
    └── tokenizer_config.json  # Tokenizer config

Source Code

GitHub: puttibenz/thai-nlp-toolkit

License

MIT

Downloads last month: 4

puttimej
/

thai-nlp-toolkit

🇹🇭 Thai NLP Toolkit

Model Architecture

Usage

Training Data

Training Details

File Structure

Source Code

License

Datasets used to train puttimej/thai-nlp-toolkit