🧠 Bi-xLSTM[1:0] — ELMo-Style Indonesian Language Model

Pretrained on Indonesian Wikipedia · ELMo-style Bidirectional CLM · Pure mLSTM Architecture

📋 Model Overview

This model implements a full ELMo-style bidirectional causal language model (CLM) using a Bi-xLSTM[1:0] backbone — meaning the architecture uses pure mLSTM blocks (ratio 1:0, no sLSTM). It is pretrained on Indonesian Wikipedia (IndoWiki) for Indonesian NLP research.

Property	Value
Architecture	Bi-xLSTM (Forward + Backward xLSTM stacks)
Block Type	mLSTM only (ratio 1:0)
Training Objective	ELMo-style (Forward CLM + Backward CLM)
Blocks per direction	8 mLSTM blocks
Embedding dimension	768
Vocabulary size	32,000
Context length	256 tokens
Attention heads	4
Dataset	Indonesian Wikipedia (`wikimedia/wikipedia`, `20231101.id`)
Tokenizer	`cahya/bert-base-indonesian-522M` (Indonesian WordPiece, vocab 32k)
Training hardware	NVIDIA A100-SXM4-80GB (85.1 GB VRAM)
Epochs	3
Effective batch size	32 (batch 16 × grad accum 2)
Learning rate	3e-4
Warmup steps	500
Framework	`xlstm==2.0.5`, PyTorch

🎯 Training Objective: Why ELMo, Not BERT?

BERT uses masked language modeling (MLM): a single bidirectional encoder sees the full sequence at once. This model instead follows the ELMo approach: two causally independent generative models trained separately:

Forward CLM: $P(t_1, t_2, \ldots, t_T) = \prod_{k=1}^{T} P(t_k \mid t_1, \ldots, t_{k-1})$

Backward CLM: $P(t_1, t_2, \ldots, t_T) = \prod_{k=1}^{T} P(t_k \mid t_{k+1}, \ldots, t_T)$

Each direction sees only one-directional context — no information leakage. The two directions are combined at the representation level, not the training level.

Contextual Representations

After pretraining, for token $t$ at layer $j$, the representations are concatenated:

$h_t = [\overrightarrow{h}_{t,j}\, ;\, \overleftarrow{h}_{t,j}]$

For downstream tasks, a weighted combination across layers can be used:

$\text{ELMo}_t = \gamma \sum_j s_j \cdot h_{t,j}$

where $s_j$ are softmax-normalized scalar weights and $\gamma$ is a task-level scaling factor.

🏗️ Architecture Details

The Bi-xLSTM model consists of two independent xLSTM stacks:

Input Tokens
     │
     ▼
┌──────────────────────────────────────────────┐
│            Token Embedding (768-dim)          │
└──────────────────────────────────────────────┘
     │                          │
     ▼                          ▼
┌──────────────┐        ┌──────────────┐
│  Forward     │        │  Backward    │
│  xLSTM Stack │        │  xLSTM Stack │
│  (8 mLSTM)  │        │  (8 mLSTM)  │
│  → → → →   │        │  ← ← ← ←   │
└──────────────┘        └──────────────┘
     │                          │
     ▼                          ▼
┌──────────────┐        ┌──────────────┐
│  LM Head     │        │  LM Head     │
│  (CLM loss) │        │  (CLM loss) │
└──────────────┘        └──────────────┘
     │                          │
     └──────── Concatenate ─────┘
                    │
                    ▼
          Contextual Representation
          h_t = [h_fwd ; h_bwd]

mLSTM Block Configuration

Each mLSTM block uses:

num_heads = 4 (head dim = 192)
proj_factor = 2.0 (feedforward expansion)
conv1d_kernel = 4 (local context convolution)
qkv_block_size = 4 (QKV projection factorization)
dropout = 0.1

📊 Hyperparameter Rationale

Parameter	Value	Rationale
`embed_dim`	768	Standard BERT-base hidden size; suitable for Indonesian NLP
`num_blocks`	8/direction	Depth comparable to BERT-base (12 layers)
`context_length`	256	Memory-efficient; Indonesian Wikipedia sentences avg ~40 tokens
`num_heads`	4	Standard for 768-dim (768/4 = 192 per head)
`batch_size`	16	Per-device; effective 32 with grad accumulation
`learning_rate`	3e-4	Standard for xLSTM pretraining
`warmup_steps`	500	~5–10% of total steps; stabilizes early training
`grad_clip`	1.0	Prevents gradient explosion in RNNs

📚 Dataset: Indonesian Wikipedia

The model is pretrained on wikimedia/wikipedia (20231101.id) — the November 2023 snapshot of Indonesian Wikipedia.

Why IndoWiki? Indonesian Wikipedia provides a large, clean, and diverse text corpus covering a wide range of formal Indonesian language usage — ideal for learning general-purpose contextual representations.

Preprocessing pipeline:

Strip wiki-markup, HTML artifacts, and template syntax
Filter documents with fewer than 50 characters (stubs)
Normalize whitespace, remove excessive newlines
Preserve casing (tokenizer is case-sensitive)

🚀 Usage

Installation

pip install xlstm==2.0.5 transformers torch

Load Tokenizer

This model uses the cahya/bert-base-indonesian-522M tokenizer — an Indonesian WordPiece tokenizer with a vocabulary of 32,000 tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("cahya/bert-base-indonesian-522M")

Extract Contextual Representations

import torch

# Tokenize input
text = "Bahasa Indonesia adalah bahasa resmi Republik Indonesia."
tokens = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)

# Load model (once weights are uploaded)
# model = BiXLSTM.from_pretrained("Fakhri2503/xLSTM")
# model.eval()

# with torch.no_grad():
#     fwd_repr, bwd_repr = model.encode(tokens["input_ids"])
#     # Concatenate for full bidirectional representation
#     repr = torch.cat([fwd_repr, bwd_repr], dim=-1)  # shape: [seq_len, 1536]

Downstream Task (ELMo-style)

# For a downstream task, use task-specific scalar mix:
# elmo_repr = gamma * sum(s_j * h_j for each layer j)
# Scalar weights s_j are learned during fine-tuning

🔬 Research Context

This model is part of a research project exploring xLSTM-based language models for Indonesian NLP benchmarks, particularly:

IndoNLU tasks (text classification, NER, sentiment analysis)
ABSA (Aspect-Based Sentiment Analysis) on Indonesian product reviews

The ELMo-style training allows the model to serve as a feature extractor, producing contextual embeddings that can be used by downstream task-specific models.

📖 Citation

If you use this model, please cite the original xLSTM paper:

@article{beck2024xlstm,
  title={xLSTM: Extended Long Short-Term Memory},
  author={Beck, Maximilian and Pöppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, Günter and Brandstetter, Johannes and Hochreiter, Sepp},
  journal={arXiv preprint arXiv:2405.04517},
  year={2024}
}

And the original ELMo paper:

@inproceedings{peters2018deep,
  title={Deep contextualized word representations},
  author={Peters, Matthew E and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  booktitle={Proceedings of NAACL-HLT 2018},
  year={2018}
}

👤 Author

Fakhri — NLP Research
Pretraining code: BiXLSTM_Pretraining_Fakhri.ipynb

Pretrained on Indonesian Wikipedia · Bidirectional xLSTM · ELMo-style objective

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train Fakhri2503/BixLSTM-IndoWiki

Paper for Fakhri2503/BixLSTM-IndoWiki

xLSTM: Extended Long Short-Term Memory

Paper • 2405.04517 • Published May 7, 2024 • 18