TELEN: Temporal Evolving Legal Embedding Network
Vietnamese legal text embedding with meta-learning for continuous adaptation to new laws.
Overview
TELEN introduces a novel embedding architecture designed specifically for Vietnamese legal text retrieval in RAG (Retrieval-Augmented Generation) systems. Unlike conventional static embedding models, TELEN generates embeddings that adapt dynamically to the current state of the legal corpus — enabling seamless integration of new laws without retraining.
Key Innovations
HyperNetwork-Driven Projection — Instead of fixed projection weights, a HyperNetwork generates the embedding projection function from the current legal corpus state. When new laws are published, the embedding space adapts automatically.
Legal Concept Graph (LCG) — An evolving knowledge graph where nodes represent legal entities (laws, key terms) and edges encode cross-references, agency hierarchy, temporal sequences, and semantic similarity.
State-Adaptive Embeddings — Embeddings are not static vectors but are modulated by a learned "legal state vector" that summarizes the entire legal landscape at any point in time.
Architecture
Legal Text
↓
Bi-Encoder (bkai-foundation-models/vietnamese-bi-encoder)
↓
Raw Representation [768-dim]
↓
┌─────────────────────────────────────┐
│ HyperNetwork(state_vector) → ΔW, Δb │ ← Generated, not learned!
│ Adapted Projection = Base + ΔW·x + Δb │
└─────────────────────────────────────┘
↓
Legal Concept Graph (GNN)
↓ state_vector
State Encoder ← current legal corpus
↓
L2-Normalized Embedding [768-dim]
Benchmark Results
Test set: 1,406 Vietnamese legal articles from 2021 (held-out, unseen during training)
| Model | NDCG@3 | NDCG@5 | NDCG@10 | MRR@3 | MRR@5 | MRR@10 |
|---|---|---|---|---|---|---|
| BM25 (lexical) | 0.6753 | 0.7173 | 0.7250 | 0.6683 | 0.6928 | 0.6990 |
| PhoBERT-base-v2 (monolingual dense) | 0.5866 | 0.6360 | 0.6505 | 0.5657 | 0.5970 | 0.6059 |
| multilingual-E5-base (multilingual dense) | 0.4675 | 0.4888 | 0.5157 | 0.4327 | 0.4452 | 0.4573 |
| BAAI/bge-m3 (multilingual dense, 1024d) | 0.4668 | 0.5129 | 0.5452 | 0.4407 | 0.4657 | 0.4802 |
| DEk21 (legal dense) | 0.7900 | 0.8127 | 0.8344 | 0.7660 | 0.7785 | 0.7865 |
| TELEN (adaptive dense) | 0.9036 | 0.9138 | 0.9132 | 0.8830 | 0.8878 | 0.8878 |
| TELEN + CE re-rank (adaptive dense) | 0.9346 | 0.9339 | 0.9238 | 0.9199 | 0.9223 | 0.9223 |
Key insight: Multilingual SOTA models (multilingual-E5, BGE-M3) score below even BM25 on Vietnamese legal text, confirming that domain and language specialization trumps generic multilingual pre-training for legal retrieval.
Relative Improvement
| Baseline | NDCG@3 | NDCG@5 | NDCG@10 | MRR@10 |
|---|---|---|---|---|
| vs multilingual-E5 | +93.3% | +86.9% | +77.1% | +94.1% |
| vs BGE-M3 | +93.6% | +78.2% | +67.5% | +84.9% |
| vs PhoBERT | +59.3% | +46.8% | +42.0% | +52.2% |
| vs DEk21 | +18.3% | +14.9% | +10.7% | +17.3% |
Quick Start
Installation
pip install -r requirements.txt
Inference
from inference import TELENInference
# Load model
model = TELENInference()
# Encode legal texts
texts = [
"Điều 1: Thông tư này quy định về quản lý thuế giá trị gia tăng...",
"Điều 2: Đối tượng áp dụng là các tổ chức, cá nhân kinh doanh...",
]
embeddings = model.encode(texts) # → [2, 768] normalized vectors
# Compute similarity
similarity = model.similarity(texts[0], texts[1])
print(f"Cosine similarity: {similarity:.4f}")
# Retrieve similar documents
results = model.retrieve(texts[0], corpus, top_k=10)
Training
# Train TELEN from scratch
python train.py
# Train cross-encoder re-ranker (optional, boosts MRR ~4%)
python train_ce.py
Evaluation
# Full benchmark (TELEN vs BM25/PhoBERT/mE5/BGE-M3/DEk21)
python eval.py
# TELEN + Cross-encoder re-ranking (MRR-optimized)
python eval_rerank.py
Training Details
Dataset
- Source: another-symato/VMTEB-Zalo-legel-retrieval-wseg on HuggingFace
- Content: 61,425 Vietnamese legal articles (Thông tư, Nghị định, Luật, Pháp lệnh)
- Period: 1999–2021
- Format: Word-segmented Vietnamese text (underscore-separated compound words)
Training Pipeline
| Stage | Description | Epochs | Trainable Params |
|---|---|---|---|
| 1. Contrastive Pretraining | Triplet + InfoNCE loss on same-law article pairs | 5 | ~1M (projection head) |
| 2. Meta-Training | HyperNetwork learns to adapt embedding space for future laws | 50 (early stop) | ~4M (HyperNetwork + State Encoder) |
Hyperparameters
| Parameter | Value |
|---|---|
| Backbone | bkai-foundation-models/vietnamese-bi-encoder |
| Embedding dimension | 768 |
| Adaptation rank | 64 |
| GNN layers | 3 |
| Meta N-way, K-shot | 16-way, 5-shot |
| Negatives per query | 256 (50% hard + 50% random) |
| Temperature | 0.05 |
| Optimizer | AdamW + CosineAnnealingWarmRestarts |
Hardware
- GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
- Training time: ~8 hours (5 contrastive + 50 meta epochs)
Continuous Adaptation
When a new law is published, TELEN adapts without retraining:
# New law arrives
new_articles = [
"Điều 1: Luật mới về trí tuệ nhân tạo...",
"Điều 2: Các nguyên tắc áp dụng AI trong xét xử...",
]
# Update concept graph (milliseconds)
model.add_new_law("123/2025/l-ai", new_articles)
# Embedding space automatically adapts via HyperNetwork
# All subsequent query embeddings reflect the new legal landscape
embeddings = model.encode(["Điều 1: ..."])
Project Structure
law-embedding/
├── dataset/
│ └── train-00000-of-00001.parquet # Training data (61K legal articles)
├── src/
│ ├── data.py # Data loading utilities
│ └── telern/
│ ├── config.py # Configuration
│ ├── model.py # TELEN architecture
│ ├── concept_graph.py # Legal Concept Graph + GNN
│ ├── hypernetwork.py # HyperNetwork + StateEncoder
│ └── evaluate.py # Evaluation metrics & baselines
├── data/checkpoints/telen/
│ └── telen_best.pt # Pretrained model weights
├── train.py # Training script
├── train_ce.py # Cross-encoder training (optional)
├── eval.py # Evaluation script
├── inference.py # Inference API
├── requirements.txt
└── README.md
Citation
@misc{telen2025,
title={TELEN: Temporal Evolving Legal Embedding Network for Vietnamese Law},
author={dangdinh},
year={2026},
publisher={Huggingface},
}
License
MIT License — see LICENSE file for details.
Acknowledgments
bkai-foundation-models/vietnamese-bi-encoder— backbone bi-encoderhuyydangg/DEk21_hcmute_embedding— baseline comparison -vinai/phobert-base-v2— used in cross-encoder re-ranker
Model tree for d4rkk3y/telen
Base model
bkai-foundation-models/vietnamese-bi-encoder