LiLT Base โ€” GGUF

GGUF conversion of SCUT-DLVCLab/lilt-roberta-en-base for use with CrispEmbed.

LiLT (Language-independent Layout Transformer) is a dual-stream encoder that combines RoBERTa (768d text) with a parallel layout transformer (192d) via BiACM (bidirectional attention complementation). This is the base model (pre-trained, no task-specific head) โ€” use it as a starting point for fine-tuning on your own document understanding tasks.

For a ready-to-use model fine-tuned on form understanding, see cstr/lilt-funsd-GGUF.

Model Details

Property Value
Architecture LiLT (RoBERTa + Layout Transformer + BiACM)
Parameters 130.7M
Hidden size 768 (text) / 192 (layout)
Layers 12
Heads 12
Vocab 50,265 (RoBERTa BPE)
License MIT

Available Formats

File Format Size
Float32 498 MB
Q8_0 134 MB
Q4_K 90 MB

Architecture

LiLT's key innovation is BiACM (Bidirectional Attention Complementation):

  1. Text and layout streams each compute separate Q/K/V projections
  2. Attention scores from both streams are summed before softmax
  3. Each stream applies the combined attention to its own values
  4. Separate FFN layers process each stream independently

This allows layout information to guide text attention patterns (and vice versa) without requiring pixel-level image features.

Layout Embeddings

Each token's bounding box [x0, y0, x1, y1] is encoded via 6 learned position embeddings (x, y, h, w) concatenated to 768d, projected to 192d, and combined with sequential position embeddings.

Parity

Verified against HuggingFace transformers:

  • 25/25 encoder stages: cos_min = 1.000000
  • max_abs < 1.6e-03 across all layers

Citation

Downloads last month
106
GGUF
Model size
0.1B params
Architecture
lilt
Hardware compatibility
Log In to add your hardware

8-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/lilt-base-GGUF

Quantized
(1)
this model