Kog Laneformer 2B

Kog Laneformer 2B is a latency-oriented Transformer variant built using Delayed Tensor Parallelism (DTP) to overlap tensor-parallel communication with useful computation and weight streaming.

The model is intended to make our Laneformer architecture available on Hugging Face for research, inspection, and fine-tuning. It is also the architecture family used in our public inference-engine preview, where single-request decoding speeds are 3,000 output tokens/s per request on 8x AMD MI300X and 2,100 output tokens/s per request on 8x NVIDIA H200 in FP16 without speculative decoding.

Those figures are Kog Inference Engine (KIE) benchmark results, not expected performance from the generic Transformers runtime.

For background, see:

Custom code notice: this model architecture is not currently part of upstream Transformers. Loading with AutoConfig, AutoModel, or AutoModelForCausalLM requires trust_remote_code=True. For production or security-sensitive use, review the modeling code and pin a specific Hub commit hash with revision="<commit-hash>".

Why this model exists

Many LLM serving systems optimize aggregate throughput across many concurrent requests. Our Laneformer work targets a different regime: low-batch, single-request decode speed, which is important for agentic coding loops, real-time copilots, voice assistants, and other sequential workflows where each generated step gates the next one.

At batch size 1, autoregressive decoding is often constrained by memory bandwidth, synchronization, kernel launch overheads, and communication latency rather than raw FLOPs. Laneformer uses a lane-structured architecture and delayed communication pattern so tensor-parallel work can be organized around the latency structure of full-node GPU inference.

Architecture overview

Laneformer follows a Llama-style decoder-only architecture, but restructures tensor-parallel communication around lanes.

In standard tensor parallelism, each attention or MLP block typically requires communication before downstream computation can proceed. In Delayed Tensor Parallelism (DTP), local outputs are communicated asynchronously and consumed several modules later, allowing communication latency to be hidden behind subsequent computation and weight streaming.

Field Value
Model family Laneformer
HF model type laneformer
Architecture class LaneformerForCausalLM
Task Decoder-only causal language modeling
Parameters ~2.3B
Hidden size 3072
Intermediate size 12288
MLP type SwiGLU
Decoder layers 15
Attention heads 32
KV heads 16
Context length 4096 tokens
Sliding window 2048 tokens
Sliding-window layers 0-9
Full-attention layers 10-14
Vocabulary size 32000
Tokenizer Llama 2 tokenizer
Number of lanes 8
DTP / Broadcast delay 2
LM head type vocab_parallel
RoPE theta 10000
Tied word embeddings false
Expected Transformers version >=4.57.1
Published weight dtype BFloat16

The Hugging Face implementation is a reference/compatibility implementation. It is useful for loading, generation, inspection, and downstream experimentation, but it is not the same execution path as our production inference engine, which uses a latency-optimized runtime and low-level GPU kernels optimization.

Installation

Nvidia

uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf

AMD

uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf

Usage

This repository includes chat_template.jinja; use tokenizer.apply_chat_template for chat and instruction prompts.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

model_id = "kogai/laneformer-2b-it"
revision = "main" # this model uses remote custom code, pin a reviewed commit hash for reproducible and safer usage
seed = 42
temperature = 0.8

print(f"Using random seed {seed}.") 
set_seed(seed)

print(f"Loading tokenizer for {model_id} ({revision=})...")
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision,
)

print("Loading model. This may take a few minutes on first run while files download...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision,
    dtype="bfloat16",
    device_map="auto",
)
model.eval()
print(f"Model loaded on {model.device}.")

# The Llama 2 tokenizer has no native pad token. If this repo sets PAD to EOS,
# always pass attention_mask for batched generation.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Tokenizer had no pad token; using EOS as the pad token.")

messages = [
    {
        "role": "user",
        "content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
    },
]

print("\nPrompt:")
print(messages[0]["content"])
print("\nApplying chat template...")
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)
print(f"Prompt token count: {inputs.input_ids.shape[-1]}")

print(f"Generating response with sampling enabled ({temperature=})...")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=True,
        temperature=temperature,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
generated = output_ids[:, inputs.input_ids.shape[-1]:]

print("\nGenerated response:")
print("-" * 80)
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

Tokenizer & Chat template

This repository uses the Llama 2 tokenizer with a 32,000-token vocabulary. The expected special-token convention is:

Token ID
<unk> 0
<s> 1
</s> 2

Instruction-tuned models in this family, such as kogai/laneformer-2b-it, include a repository-level chat_template.jinja.

For the current instruction-tuned template:

  • Only user and assistant roles are supported.
  • System messages are ignored.
  • Each user or assistant message is formatted with a Llama-3-style role header, followed by trimmed message content and </s>.

Usage:

messages = [
    {
        "role": "user",
        "content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

Training data and procedure

Laneformer 2B was trained with a staged language-modeling recipe:

  1. Broad pre-training to build general language-modeling capabilities.
  2. Code- and reasoning-heavy continuation training.
  3. Lightweight post-training to produce the released instruction-tuned model.

The TorchTitan training stages used sequence length 4,096 and global batch size 1,536, or approximately 6.29M tokens per optimizer step.

Training stages

Stage Focus Framework Tokens Steps Approx. tokens / step Notes
Pre-training Broad language-model pre-training TorchTitan ~4T 620,000 ~6.29M Nemotron-derived broad mixture; final checkpoint selected.
Mid-training Code- and reasoning-focused continuation TorchTitan ~2T 310,000 ~6.29M Continued from the pre-training checkpoint with a code/reasoning-heavy mixture.
Post-training SFT, instruction tuning, identity tuning Hugging Face Transformers ~210M 200 ~1.05M Lightweight custom data mixture initialized from the mid-training checkpoint.

Data sources

Both TorchTitan stages depend primarily on NVIDIA Nemotron pre-training datasets. Pre-training uses a Nemotron-CC-v2-centered mixture for broad language-model pretraining, with web, synthetic web, QA, math, SFT-style, and code components. Mid-training continues from the pre-training checkpoint on a more code- and reasoning-heavy Nemotron-derived mixture, increasing the emphasis on code metadata, synthetic code data, Common Crawl code pages, math, STEM, and reasoning data.

Category Pre-training / Phase 1 Mid-training / Phase 2 Notes
General / web crawl 63.2% 10.0% Phase 2 keeps a smaller high-quality and synthetic web component.
Code 21.0% 59.01% Large increase from code metadata, synthetic code data, and Common Crawl code pages.
Math / STEM / reasoning 6.3% 30.01% Large increase from math, RQA, STEM, and textbook-like data.
Multilingual 5.0% 0.0% Removed in the current Phase 2 mixture.
QA / instruction / academic 4.5% 1.0% Only a small general SFT-style component remains.

Training infrastructure

Field Value
Pre-training framework TorchTitan
Post-training framework Hugging Face Transformers
Training hardware 24 nodes × 8 NVIDIA H100 GPUs = 192 H100 GPUs
Training clusters Scaleway cluster; ADASTRA cluster
Cloud/provider/region France
Training time ~21 days
Precision FP32/BF16 mixed precision
Parallelism FSDP
Optimizer AdamW
Learning-rate schedule WSD
Sequence length during training 4096
Checkpoint selection final step

Inference and performance

Kog Inference Engine running the Laneformer 2B model public-preview numbers:

Setting Reported speed Notes
8x AMD MI300X 3,000 output tokens/s/request FP16, batch size 1, no speculative decoding
8x NVIDIA H200 2,100 output tokens/s/request FP16, batch size 1, no speculative decoding

These benchmark numbers are for Kog's optimized inference stack, not the plain Hugging Face Transformers implementation in this repository. This preview does not rely on quantization, speculative decoding, pruning, early exit, or KV-cache compression to reach this speed.

Evaluation

The following internal Kog evaluations were run in June 2026 against a local Kog serving endpoint for the Laneformer 2B instruction-tuned checkpoint at batch size 1 in FP16. These values evaluate the served preview model; small numerical differences may appear when evaluating the BF16 Hugging Face checkpoint directly through a different runtime.

Code generation

HumanEval+ and MBPP+ were evaluated with greedy decoding. Generation was performed through the Kog serving endpoint, scoring used EvalPlus.

A custom code-block selection step named target_function was applied before scoring. When possible, this step selects the code block containing the target function name before EvalPlus preprocessing.

Benchmark Metric Value Samples Decoding Scoring Postprocessing
HumanEval+ pass@1 45.1 164 Greedy, temperature=0, do_sample=False EvalPlus target_function block selection
MBPP+ pass@1 51.6 378 Greedy, temperature=0, do_sample=False EvalPlus target_function block selection

General multiple-choice checks

ARC-Challenge and ARC-Easy were evaluated with 0-shot multiple-choice logprobs. Orchestration used lm-evaluation-harness 0.4.12 against the Kog serving endpoint.

Benchmark Metric Value Samples Setting
ARC-Challenge Normalized accuracy 31.06 1,172 0-shot multiple-choice logprobs, lm_eval
ARC-Easy Normalized accuracy 47.90 2,376 0-shot multiple-choice logprobs, lm_eval

Long-context synthetic checks

Long-context checks were also run with RULER-style synthetic tasks at 2,048 and 4,096 tokens using 0-shot greedy chat generation using the LM Evaluation Harness framework. Values below are string-match scores.

Task 2,048 tokens 4,096 tokens Effective samples Setting
NIAH single 1 100.00 100.00 500 using lm_eval
NIAH single 2 100.00 100.00 500 using lm_eval
NIAH single 3 99.80 91.60 500 using lm_eval
NIAH multikey 1 99.20 83.40 500 using lm_eval
NIAH multikey 2 97.80 60.00 500 using lm_eval
NIAH multikey 3 91.80 91.20 500 using lm_eval
NIAH multiquery 95.40 69.70 500 using lm_eval
NIAH multivalue 94.60 82.20 500 using lm_eval
RULER variable tracking 77.92 3.92 500 using lm_eval
RULER common words extraction 24.72 78.14 500 using lm_eval
RULER frequent words extraction 78.80 64.73 500 using lm_eval

Intended use

This model is intended for:

  • Research and experimentation with latency-oriented Transformer architectures.
  • Evaluation of Laneformer / Delayed Tensor Parallelism design choices.
  • Causal language modeling and code-generation experiments.
  • Fine-tuning experiments, subject to all applicable license terms.
  • Inference-system and attention-backend testing.

This model is not automatically suitable for high-stakes settings such as medical, legal, financial, employment, education, public-sector, law-enforcement, or safety-critical decision-making.

Out-of-scope use

Do not use this model or tokenizer in ways that violate the repository licenses, the Llama 2 Community License, the Llama 2 Acceptable Use Policy, applicable law, privacy rights, or safety policies.

Do not present generated outputs as factual without independent verification. Do not use the model as the sole decision-maker in high-stakes workflows.

Limitations and risks

  • This is a 2B-class model and is not a frontier general-purpose assistant.
  • The model is primarily mid-trained and post-trained for code generation: it is not optimized as a broad general-purpose assistant.
  • The model may generate incorrect, insecure, biased, toxic, or misleading content.
  • The model may hallucinate facts, APIs, package names, citations, and code behavior.
  • The base model may not follow instructions reliably unless it has been instruction-tuned.
  • The public HF implementation prioritizes compatibility and inspection, not the full Kog Inference Engine latency path.
  • Performance may depend on the exact PyTorch, Transformers, device, dtype, and attention backend used.
  • Because this repo uses custom code, downstream users should review code before execution and pin a revision in production.

License

This repository is multi-license.

Kog-owned materials: Apache License 2.0

Unless a file states otherwise, the model weights, Hugging Face custom modeling and configuration code, model configuration files, metadata files, documentation, and README/model card are released under the Apache License 2.0.

See LICENSE for the full Apache License 2.0 text.

Tokenizer materials: Llama 2 Community License

The tokenizer files are based on the Llama 2 tokenizer and are not licensed under Apache License 2.0. These tokenizer materials include, without limitation:

  • tokenizer.model
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • any other file whose purpose is to reproduce or configure the Llama 2 tokenizer

These tokenizer materials are distributed under the LLAMA 2 Community License Agreement. See THIRD_PARTY_LICENSES/LLAMA2_LICENSE and NOTICE.

Required Llama 2 attribution notice:

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Users and redistributors are responsible for complying with the Llama 2 Community License and its Acceptable Use Policy.

Citation

If you use this model or architecture, please cite the model repository and the relevant Kog technical posts:

@misc{kog_laneformer_2b_it_2026,
  title        = {Kog Laneformer 2B Instruct Model},
  author       = {Kog Team},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kogai/laneformer-2b-it}}
}

@online{kog_laneformer_2b_2026,
  title        = {{Laneformer 2B: The Latency-First Model Behind Kog Inference Engine}},
  author       = {Kog Team},
  year         = {2026},
  url          = {https://huggingface.co/blog/kogai/kog-laneformer-2b-the-latency-first-model}
  note         = {HuggingFace blog}
}

@misc{kog_real_time_llm_inference_2026,
  title        = {Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)},
  author       = {Kog Team},
  year         = {2026},
  howpublished = {\url{https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/}}
}

@misc{kog_delayed_tensor_parallelism_2026,
  title        = {Delayed Tensor Parallelism for Faster Transformer Inference},
  author       = {Kog Team},
  year         = {2026},
  howpublished = {\url{https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference/}}
}

Contact

For questions about this model, open a discussion or issue on the Hugging Face repository, or contact Kog AI through the channels listed on kog.ai.

Downloads last month
181
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kogai/laneformer-2b-it

Quantizations
1 model

Datasets used to train kogai/laneformer-2b-it

Article mentioning kogai/laneformer-2b-it