π§ Bi-xLSTM[1:0] β ELMo-Style Indonesian Language Model
Pretrained on Indonesian Wikipedia Β· ELMo-style Bidirectional CLM Β· Pure mLSTM Architecture
π Model Overview
This model implements a full ELMo-style bidirectional causal language model (CLM) using a Bi-xLSTM[1:0] backbone β meaning the architecture uses pure mLSTM blocks (ratio 1:0, no sLSTM). It is pretrained on Indonesian Wikipedia (IndoWiki) for Indonesian NLP research.
| Property | Value |
|---|---|
| Architecture | Bi-xLSTM (Forward + Backward xLSTM stacks) |
| Block Type | mLSTM only (ratio 1:0) |
| Training Objective | ELMo-style (Forward CLM + Backward CLM) |
| Blocks per direction | 8 mLSTM blocks |
| Embedding dimension | 768 |
| Vocabulary size | 32,000 |
| Context length | 256 tokens |
| Attention heads | 4 |
| Dataset | Indonesian Wikipedia (wikimedia/wikipedia, 20231101.id) |
| Tokenizer | cahya/bert-base-indonesian-522M (Indonesian WordPiece, vocab 32k) |
| Training hardware | NVIDIA A100-SXM4-80GB (85.1 GB VRAM) |
| Epochs | 3 |
| Effective batch size | 32 (batch 16 Γ grad accum 2) |
| Learning rate | 3e-4 |
| Warmup steps | 500 |
| Framework | xlstm==2.0.5, PyTorch |
π― Training Objective: Why ELMo, Not BERT?
BERT uses masked language modeling (MLM): a single bidirectional encoder sees the full sequence at once. This model instead follows the ELMo approach: two causally independent generative models trained separately:
Forward CLM:
Backward CLM:
Each direction sees only one-directional context β no information leakage. The two directions are combined at the representation level, not the training level.
Contextual Representations
After pretraining, for token $t$ at layer $j$, the representations are concatenated:
For downstream tasks, a weighted combination across layers can be used:
where $s_j$ are softmax-normalized scalar weights and $\gamma$ is a task-level scaling factor.
ποΈ Architecture Details
The Bi-xLSTM model consists of two independent xLSTM stacks:
Input Tokens
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Token Embedding (768-dim) β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Forward β β Backward β
β xLSTM Stack β β xLSTM Stack β
β (8 mLSTM) β β (8 mLSTM) β
β β β β β β β β β β β β
ββββββββββββββββ ββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β LM Head β β LM Head β
β (CLM loss) β β (CLM loss) β
ββββββββββββββββ ββββββββββββββββ
β β
βββββββββ Concatenate ββββββ
β
βΌ
Contextual Representation
h_t = [h_fwd ; h_bwd]
mLSTM Block Configuration
Each mLSTM block uses:
num_heads = 4(head dim = 192)proj_factor = 2.0(feedforward expansion)conv1d_kernel = 4(local context convolution)qkv_block_size = 4(QKV projection factorization)dropout = 0.1
π Hyperparameter Rationale
| Parameter | Value | Rationale |
|---|---|---|
embed_dim |
768 | Standard BERT-base hidden size; suitable for Indonesian NLP |
num_blocks |
8/direction | Depth comparable to BERT-base (12 layers) |
context_length |
256 | Memory-efficient; Indonesian Wikipedia sentences avg ~40 tokens |
num_heads |
4 | Standard for 768-dim (768/4 = 192 per head) |
batch_size |
16 | Per-device; effective 32 with grad accumulation |
learning_rate |
3e-4 | Standard for xLSTM pretraining |
warmup_steps |
500 | ~5β10% of total steps; stabilizes early training |
grad_clip |
1.0 | Prevents gradient explosion in RNNs |
π Dataset: Indonesian Wikipedia
The model is pretrained on wikimedia/wikipedia (20231101.id) β the November 2023 snapshot of Indonesian Wikipedia.
Why IndoWiki? Indonesian Wikipedia provides a large, clean, and diverse text corpus covering a wide range of formal Indonesian language usage β ideal for learning general-purpose contextual representations.
Preprocessing pipeline:
- Strip wiki-markup, HTML artifacts, and template syntax
- Filter documents with fewer than 50 characters (stubs)
- Normalize whitespace, remove excessive newlines
- Preserve casing (tokenizer is case-sensitive)
π Usage
Installation
pip install xlstm==2.0.5 transformers torch
Load Tokenizer
This model uses the cahya/bert-base-indonesian-522M tokenizer β an Indonesian WordPiece tokenizer with a vocabulary of 32,000 tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cahya/bert-base-indonesian-522M")
Extract Contextual Representations
import torch
# Tokenize input
text = "Bahasa Indonesia adalah bahasa resmi Republik Indonesia."
tokens = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
# Load model (once weights are uploaded)
# model = BiXLSTM.from_pretrained("Fakhri2503/xLSTM")
# model.eval()
# with torch.no_grad():
# fwd_repr, bwd_repr = model.encode(tokens["input_ids"])
# # Concatenate for full bidirectional representation
# repr = torch.cat([fwd_repr, bwd_repr], dim=-1) # shape: [seq_len, 1536]
Downstream Task (ELMo-style)
# For a downstream task, use task-specific scalar mix:
# elmo_repr = gamma * sum(s_j * h_j for each layer j)
# Scalar weights s_j are learned during fine-tuning
π¬ Research Context
This model is part of a research project exploring xLSTM-based language models for Indonesian NLP benchmarks, particularly:
- IndoNLU tasks (text classification, NER, sentiment analysis)
- ABSA (Aspect-Based Sentiment Analysis) on Indonesian product reviews
The ELMo-style training allows the model to serve as a feature extractor, producing contextual embeddings that can be used by downstream task-specific models.
π Citation
If you use this model, please cite the original xLSTM paper:
@article{beck2024xlstm,
title={xLSTM: Extended Long Short-Term Memory},
author={Beck, Maximilian and PΓΆppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, GΓΌnter and Brandstetter, Johannes and Hochreiter, Sepp},
journal={arXiv preprint arXiv:2405.04517},
year={2024}
}
And the original ELMo paper:
@inproceedings{peters2018deep,
title={Deep contextualized word representations},
author={Peters, Matthew E and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
booktitle={Proceedings of NAACL-HLT 2018},
year={2018}
}
π€ Author
Fakhri β NLP Research
Pretraining code: BiXLSTM_Pretraining_Fakhri.ipynb