LiLT Base โ GGUF
GGUF conversion of SCUT-DLVCLab/lilt-roberta-en-base for use with CrispEmbed.
LiLT (Language-independent Layout Transformer) is a dual-stream encoder that combines RoBERTa (768d text) with a parallel layout transformer (192d) via BiACM (bidirectional attention complementation). This is the base model (pre-trained, no task-specific head) โ use it as a starting point for fine-tuning on your own document understanding tasks.
For a ready-to-use model fine-tuned on form understanding, see cstr/lilt-funsd-GGUF.
Model Details
| Property | Value |
|---|---|
| Architecture | LiLT (RoBERTa + Layout Transformer + BiACM) |
| Parameters | 130.7M |
| Hidden size | 768 (text) / 192 (layout) |
| Layers | 12 |
| Heads | 12 |
| Vocab | 50,265 (RoBERTa BPE) |
| License | MIT |
Available Formats
| File | Format | Size |
|---|---|---|
| Float32 | 498 MB | |
| Q8_0 | 134 MB | |
| Q4_K | 90 MB |
Architecture
LiLT's key innovation is BiACM (Bidirectional Attention Complementation):
- Text and layout streams each compute separate Q/K/V projections
- Attention scores from both streams are summed before softmax
- Each stream applies the combined attention to its own values
- Separate FFN layers process each stream independently
This allows layout information to guide text attention patterns (and vice versa) without requiring pixel-level image features.
Layout Embeddings
Each token's bounding box [x0, y0, x1, y1] is encoded via 6 learned position embeddings (x, y, h, w) concatenated to 768d, projected to 192d, and combined with sequential position embeddings.
Parity
Verified against HuggingFace transformers:
- 25/25 encoder stages: cos_min = 1.000000
- max_abs < 1.6e-03 across all layers
Citation
- Downloads last month
- 106
8-bit
32-bit
Model tree for cstr/lilt-base-GGUF
Base model
SCUT-DLVCLab/lilt-roberta-en-base