LiLT Base — GGUF

GGUF conversion of SCUT-DLVCLab/lilt-roberta-en-base for use with CrispEmbed.

LiLT (Language-independent Layout Transformer) is a dual-stream encoder that combines RoBERTa (768d text) with a parallel layout transformer (192d) via BiACM (bidirectional attention complementation). This is the base model (pre-trained, no task-specific head) — use it as a starting point for fine-tuning on your own document understanding tasks.

For a ready-to-use model fine-tuned on form understanding, see cstr/lilt-funsd-GGUF.

Model Details

Property	Value
Architecture	LiLT (RoBERTa + Layout Transformer + BiACM)
Parameters	130.7M
Hidden size	768 (text) / 192 (layout)
Layers	12
Heads	12
Vocab	50,265 (RoBERTa BPE)
License	MIT

Available Formats

File	Format	Size
	Float32	498 MB
	Q8_0	134 MB
	Q4_K	90 MB

Architecture

LiLT's key innovation is BiACM (Bidirectional Attention Complementation):

Text and layout streams each compute separate Q/K/V projections
Attention scores from both streams are summed before softmax
Each stream applies the combined attention to its own values
Separate FFN layers process each stream independently

This allows layout information to guide text attention patterns (and vice versa) without requiring pixel-level image features.

Layout Embeddings

Each token's bounding box [x0, y0, x1, y1] is encoded via 6 learned position embeddings (x, y, h, w) concatenated to 768d, projected to 192d, and combined with sequential position embeddings.

Parity

Verified against HuggingFace transformers:

25/25 encoder stages: cos_min = 1.000000
max_abs < 1.6e-03 across all layers

Citation

Downloads last month: 106

GGUF

Model size

0.1B params

Architecture

lilt

Hardware compatibility

8-bit

32-bit

Model tree for cstr/lilt-base-GGUF

Base model

SCUT-DLVCLab/lilt-roberta-en-base

Quantized

(1)

this model