🧠 Bi-xLSTM[1:0] β€” ELMo-Style Indonesian Language Model

Pretrained on Indonesian Wikipedia Β· ELMo-style Bidirectional CLM Β· Pure mLSTM Architecture

Model Language License Framework


πŸ“‹ Model Overview

This model implements a full ELMo-style bidirectional causal language model (CLM) using a Bi-xLSTM[1:0] backbone β€” meaning the architecture uses pure mLSTM blocks (ratio 1:0, no sLSTM). It is pretrained on Indonesian Wikipedia (IndoWiki) for Indonesian NLP research.

Property Value
Architecture Bi-xLSTM (Forward + Backward xLSTM stacks)
Block Type mLSTM only (ratio 1:0)
Training Objective ELMo-style (Forward CLM + Backward CLM)
Blocks per direction 8 mLSTM blocks
Embedding dimension 768
Vocabulary size 32,000
Context length 256 tokens
Attention heads 4
Dataset Indonesian Wikipedia (wikimedia/wikipedia, 20231101.id)
Tokenizer cahya/bert-base-indonesian-522M (Indonesian WordPiece, vocab 32k)
Training hardware NVIDIA A100-SXM4-80GB (85.1 GB VRAM)
Epochs 3
Effective batch size 32 (batch 16 Γ— grad accum 2)
Learning rate 3e-4
Warmup steps 500
Framework xlstm==2.0.5, PyTorch

🎯 Training Objective: Why ELMo, Not BERT?

BERT uses masked language modeling (MLM): a single bidirectional encoder sees the full sequence at once. This model instead follows the ELMo approach: two causally independent generative models trained separately:

Forward CLM: P(t1,t2,…,tT)=∏k=1TP(tk∣t1,…,tkβˆ’1)P(t_1, t_2, \ldots, t_T) = \prod_{k=1}^{T} P(t_k \mid t_1, \ldots, t_{k-1})

Backward CLM: P(t1,t2,…,tT)=∏k=1TP(tk∣tk+1,…,tT)P(t_1, t_2, \ldots, t_T) = \prod_{k=1}^{T} P(t_k \mid t_{k+1}, \ldots, t_T)

Each direction sees only one-directional context β€” no information leakage. The two directions are combined at the representation level, not the training level.

Contextual Representations

After pretraining, for token $t$ at layer $j$, the representations are concatenated:

ht=[hβ†’t,j ; h←t,j]h_t = [\overrightarrow{h}_{t,j}\, ;\, \overleftarrow{h}_{t,j}]

For downstream tasks, a weighted combination across layers can be used:

ELMot=Ξ³βˆ‘jsjβ‹…ht,j\text{ELMo}_t = \gamma \sum_j s_j \cdot h_{t,j}

where $s_j$ are softmax-normalized scalar weights and $\gamma$ is a task-level scaling factor.


πŸ—οΈ Architecture Details

The Bi-xLSTM model consists of two independent xLSTM stacks:

Input Tokens
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Token Embedding (768-dim)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                          β”‚
     β–Ό                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Forward     β”‚        β”‚  Backward    β”‚
β”‚  xLSTM Stack β”‚        β”‚  xLSTM Stack β”‚
β”‚  (8 mLSTM)  β”‚        β”‚  (8 mLSTM)  β”‚
β”‚  β†’ β†’ β†’ β†’   β”‚        β”‚  ← ← ← ←   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                          β”‚
     β–Ό                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LM Head     β”‚        β”‚  LM Head     β”‚
β”‚  (CLM loss) β”‚        β”‚  (CLM loss) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                          β”‚
     └──────── Concatenate β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
          Contextual Representation
          h_t = [h_fwd ; h_bwd]

mLSTM Block Configuration

Each mLSTM block uses:

  • num_heads = 4 (head dim = 192)
  • proj_factor = 2.0 (feedforward expansion)
  • conv1d_kernel = 4 (local context convolution)
  • qkv_block_size = 4 (QKV projection factorization)
  • dropout = 0.1

πŸ“Š Hyperparameter Rationale

Parameter Value Rationale
embed_dim 768 Standard BERT-base hidden size; suitable for Indonesian NLP
num_blocks 8/direction Depth comparable to BERT-base (12 layers)
context_length 256 Memory-efficient; Indonesian Wikipedia sentences avg ~40 tokens
num_heads 4 Standard for 768-dim (768/4 = 192 per head)
batch_size 16 Per-device; effective 32 with grad accumulation
learning_rate 3e-4 Standard for xLSTM pretraining
warmup_steps 500 ~5–10% of total steps; stabilizes early training
grad_clip 1.0 Prevents gradient explosion in RNNs

πŸ“š Dataset: Indonesian Wikipedia

The model is pretrained on wikimedia/wikipedia (20231101.id) β€” the November 2023 snapshot of Indonesian Wikipedia.

Why IndoWiki? Indonesian Wikipedia provides a large, clean, and diverse text corpus covering a wide range of formal Indonesian language usage β€” ideal for learning general-purpose contextual representations.

Preprocessing pipeline:

  1. Strip wiki-markup, HTML artifacts, and template syntax
  2. Filter documents with fewer than 50 characters (stubs)
  3. Normalize whitespace, remove excessive newlines
  4. Preserve casing (tokenizer is case-sensitive)

πŸš€ Usage

Installation

pip install xlstm==2.0.5 transformers torch

Load Tokenizer

This model uses the cahya/bert-base-indonesian-522M tokenizer β€” an Indonesian WordPiece tokenizer with a vocabulary of 32,000 tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("cahya/bert-base-indonesian-522M")

Extract Contextual Representations

import torch

# Tokenize input
text = "Bahasa Indonesia adalah bahasa resmi Republik Indonesia."
tokens = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)

# Load model (once weights are uploaded)
# model = BiXLSTM.from_pretrained("Fakhri2503/xLSTM")
# model.eval()

# with torch.no_grad():
#     fwd_repr, bwd_repr = model.encode(tokens["input_ids"])
#     # Concatenate for full bidirectional representation
#     repr = torch.cat([fwd_repr, bwd_repr], dim=-1)  # shape: [seq_len, 1536]

Downstream Task (ELMo-style)

# For a downstream task, use task-specific scalar mix:
# elmo_repr = gamma * sum(s_j * h_j for each layer j)
# Scalar weights s_j are learned during fine-tuning

πŸ”¬ Research Context

This model is part of a research project exploring xLSTM-based language models for Indonesian NLP benchmarks, particularly:

  • IndoNLU tasks (text classification, NER, sentiment analysis)
  • ABSA (Aspect-Based Sentiment Analysis) on Indonesian product reviews

The ELMo-style training allows the model to serve as a feature extractor, producing contextual embeddings that can be used by downstream task-specific models.


πŸ“– Citation

If you use this model, please cite the original xLSTM paper:

@article{beck2024xlstm,
  title={xLSTM: Extended Long Short-Term Memory},
  author={Beck, Maximilian and PΓΆppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, GΓΌnter and Brandstetter, Johannes and Hochreiter, Sepp},
  journal={arXiv preprint arXiv:2405.04517},
  year={2024}
}

And the original ELMo paper:

@inproceedings{peters2018deep,
  title={Deep contextualized word representations},
  author={Peters, Matthew E and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  booktitle={Proceedings of NAACL-HLT 2018},
  year={2018}
}

πŸ‘€ Author

Fakhri β€” NLP Research
Pretraining code: BiXLSTM_Pretraining_Fakhri.ipynb


Pretrained on Indonesian Wikipedia Β· Bidirectional xLSTM Β· ELMo-style objective
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Fakhri2503/BixLSTM-IndoWiki

Paper for Fakhri2503/BixLSTM-IndoWiki