Instructions to use kogai/laneformer-2b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kogai/laneformer-2b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kogai/laneformer-2b-it", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("kogai/laneformer-2b-it", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kogai/laneformer-2b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kogai/laneformer-2b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kogai/laneformer-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kogai/laneformer-2b-it

SGLang

How to use kogai/laneformer-2b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kogai/laneformer-2b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kogai/laneformer-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kogai/laneformer-2b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kogai/laneformer-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kogai/laneformer-2b-it with Docker Model Runner:
```
docker model run hf.co/kogai/laneformer-2b-it
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Kog Laneformer 2B

Kog Laneformer 2B is a latency-oriented Transformer variant built using Delayed Tensor Parallelism (DTP) to overlap tensor-parallel communication with useful computation and weight streaming.

The model is intended to make our Laneformer architecture available on Hugging Face for research, inspection, and fine-tuning. It is also the architecture family used in our public inference-engine preview, where single-request decoding speeds are 3,000 output tokens/s per request on 8x AMD MI300X and 2,100 output tokens/s per request on 8x NVIDIA H200 in FP16 without speculative decoding.

Those figures are Kog Inference Engine (KIE) benchmark results, not expected performance from the generic Transformers runtime.

For background, see:

Custom code notice: this model architecture is not currently part of upstream Transformers. Loading with AutoConfig, AutoModel, or AutoModelForCausalLM requires trust_remote_code=True. For production or security-sensitive use, review the modeling code and pin a specific Hub commit hash with revision="<commit-hash>".

Why this model exists

Many LLM serving systems optimize aggregate throughput across many concurrent requests. Our Laneformer work targets a different regime: low-batch, single-request decode speed, which is important for agentic coding loops, real-time copilots, voice assistants, and other sequential workflows where each generated step gates the next one.

At batch size 1, autoregressive decoding is often constrained by memory bandwidth, synchronization, kernel launch overheads, and communication latency rather than raw FLOPs. Laneformer uses a lane-structured architecture and delayed communication pattern so tensor-parallel work can be organized around the latency structure of full-node GPU inference.

Architecture overview

Laneformer follows a Llama-style decoder-only architecture, but restructures tensor-parallel communication around lanes.

In standard tensor parallelism, each attention or MLP block typically requires communication before downstream computation can proceed. In Delayed Tensor Parallelism (DTP), local outputs are communicated asynchronously and consumed several modules later, allowing communication latency to be hidden behind subsequent computation and weight streaming.

Field	Value
Model family	Laneformer
HF model type	`laneformer`
Architecture class	`LaneformerForCausalLM`
Task	Decoder-only causal language modeling
Parameters	~2.3B
Hidden size	3072
Intermediate size	12288
MLP type	SwiGLU
Decoder layers	15
Attention heads	32
KV heads	16
Context length	4096 tokens
Sliding window	2048 tokens
Sliding-window layers	0-9
Full-attention layers	10-14
Vocabulary size	32000
Tokenizer	Llama 2 tokenizer
Number of lanes	8
DTP / Broadcast delay	2
LM head type	`vocab_parallel`
RoPE theta	10000
Tied word embeddings	false
Expected Transformers version	`>=4.57.1`
Published weight dtype	BFloat16

The Hugging Face implementation is a reference/compatibility implementation. It is useful for loading, generation, inspection, and downstream experimentation, but it is not the same execution path as our production inference engine, which uses a latency-optimized runtime and low-level GPU kernels optimization.

Installation

Nvidia

uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf

AMD

uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf

Usage

This repository includes chat_template.jinja; use tokenizer.apply_chat_template for chat and instruction prompts.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

model_id = "kogai/laneformer-2b-it"
revision = "main" # this model uses remote custom code, pin a reviewed commit hash for reproducible and safer usage
seed = 42
temperature = 0.8

print(f"Using random seed {seed}.") 
set_seed(seed)

print(f"Loading tokenizer for {model_id} ({revision=})...")
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision,
)

print("Loading model. This may take a few minutes on first run while files download...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision,
    dtype="bfloat16",
    device_map="auto",
)
model.eval()
print(f"Model loaded on {model.device}.")

# The Llama 2 tokenizer has no native pad token. If this repo sets PAD to EOS,
# always pass attention_mask for batched generation.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Tokenizer had no pad token; using EOS as the pad token.")

messages = [
    {
        "role": "user",
        "content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
    },
]

print("\nPrompt:")
print(messages[0]["content"])
print("\nApplying chat template...")
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)
print(f"Prompt token count: {inputs.input_ids.shape[-1]}")

print(f"Generating response with sampling enabled ({temperature=})...")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=True,
        temperature=temperature,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
generated = output_ids[:, inputs.input_ids.shape[-1]:]

print("\nGenerated response:")
print("-" * 80)
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

Tokenizer & Chat template

This repository uses the Llama 2 tokenizer with a 32,000-token vocabulary. The expected special-token convention is:

Token	ID
`<unk>`	0
`<s>`	1
`</s>`	2

Instruction-tuned models in this family, such as kogai/laneformer-2b-it, include a repository-level chat_template.jinja.

For the current instruction-tuned template:

Only user and assistant roles are supported.
System messages are ignored.
Each user or assistant message is formatted with a Llama-3-style role header, followed by trimmed message content and </s>.

Usage:

messages = [
    {
        "role": "user",
        "content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

Training data and procedure

Laneformer 2B was trained with a staged language-modeling recipe:

Broad pre-training to build general language-modeling capabilities.
Code- and reasoning-heavy continuation training.
Lightweight post-training to produce the released instruction-tuned model.

The TorchTitan training stages used sequence length 4,096 and global batch size 1,536, or approximately 6.29M tokens per optimizer step.

Training stages

Stage	Focus	Framework	Tokens	Steps	Approx. tokens / step	Notes
Pre-training	Broad language-model pre-training	TorchTitan	~4T	620,000	~6.29M	Nemotron-derived broad mixture; final checkpoint selected.
Mid-training	Code- and reasoning-focused continuation	TorchTitan	~2T	310,000	~6.29M	Continued from the pre-training checkpoint with a code/reasoning-heavy mixture.
Post-training	SFT, instruction tuning, identity tuning	Hugging Face Transformers	~210M	200	~1.05M	Lightweight custom data mixture initialized from the mid-training checkpoint.

Data sources

Both TorchTitan stages depend primarily on NVIDIA Nemotron pre-training datasets. Pre-training uses a Nemotron-CC-v2-centered mixture for broad language-model pretraining, with web, synthetic web, QA, math, SFT-style, and code components. Mid-training continues from the pre-training checkpoint on a more code- and reasoning-heavy Nemotron-derived mixture, increasing the emphasis on code metadata, synthetic code data, Common Crawl code pages, math, STEM, and reasoning data.

Category	Pre-training / Phase 1	Mid-training / Phase 2	Notes
General / web crawl	63.2%	10.0%	Phase 2 keeps a smaller high-quality and synthetic web component.
Code	21.0%	59.01%	Large increase from code metadata, synthetic code data, and Common Crawl code pages.
Math / STEM / reasoning	6.3%	30.01%	Large increase from math, RQA, STEM, and textbook-like data.
Multilingual	5.0%	0.0%	Removed in the current Phase 2 mixture.
QA / instruction / academic	4.5%	1.0%	Only a small general SFT-style component remains.

Training infrastructure

Field	Value
Pre-training framework	TorchTitan
Post-training framework	Hugging Face Transformers
Training hardware	24 nodes × 8 NVIDIA H100 GPUs = 192 H100 GPUs
Training clusters	Scaleway cluster; ADASTRA cluster
Cloud/provider/region	France
Training time	~21 days
Precision	FP32/BF16 mixed precision
Parallelism	FSDP
Optimizer	AdamW
Learning-rate schedule	WSD
Sequence length during training	4096
Checkpoint selection	final step

Inference and performance

Kog Inference Engine running the Laneformer 2B model public-preview numbers:

Setting	Reported speed	Notes
8x AMD MI300X	3,000 output tokens/s/request	FP16, batch size 1, no speculative decoding
8x NVIDIA H200	2,100 output tokens/s/request	FP16, batch size 1, no speculative decoding

These benchmark numbers are for Kog's optimized inference stack, not the plain Hugging Face Transformers implementation in this repository. This preview does not rely on quantization, speculative decoding, pruning, early exit, or KV-cache compression to reach this speed.

Evaluation

The following internal Kog evaluations were run in June 2026 against a local Kog serving endpoint for the Laneformer 2B instruction-tuned checkpoint at batch size 1 in FP16. These values evaluate the served preview model; small numerical differences may appear when evaluating the BF16 Hugging Face checkpoint directly through a different runtime.

Code generation

HumanEval+ and MBPP+ were evaluated with greedy decoding. Generation was performed through the Kog serving endpoint, scoring used EvalPlus.

A custom code-block selection step named target_function was applied before scoring. When possible, this step selects the code block containing the target function name before EvalPlus preprocessing.

Benchmark	Metric	Value	Samples	Decoding	Scoring	Postprocessing
HumanEval+	pass@1	45.1	164	Greedy, temperature=0, `do_sample=False`	EvalPlus	`target_function` block selection
MBPP+	pass@1	51.6	378	Greedy, temperature=0, `do_sample=False`	EvalPlus	`target_function` block selection

General multiple-choice checks

ARC-Challenge and ARC-Easy were evaluated with 0-shot multiple-choice logprobs. Orchestration used lm-evaluation-harness 0.4.12 against the Kog serving endpoint.

Benchmark	Metric	Value	Samples	Setting
ARC-Challenge	Normalized accuracy	31.06	1,172	0-shot multiple-choice logprobs, `lm_eval`
ARC-Easy	Normalized accuracy	47.90	2,376	0-shot multiple-choice logprobs, `lm_eval`

Long-context synthetic checks

Long-context checks were also run with RULER-style synthetic tasks at 2,048 and 4,096 tokens using 0-shot greedy chat generation using the LM Evaluation Harness framework. Values below are string-match scores.

Task	2,048 tokens	4,096 tokens	Effective samples	Setting
NIAH single 1	100.00	100.00	500	using `lm_eval`
NIAH single 2	100.00	100.00	500	using `lm_eval`
NIAH single 3	99.80	91.60	500	using `lm_eval`
NIAH multikey 1	99.20	83.40	500	using `lm_eval`
NIAH multikey 2	97.80	60.00	500	using `lm_eval`
NIAH multikey 3	91.80	91.20	500	using `lm_eval`
NIAH multiquery	95.40	69.70	500	using `lm_eval`
NIAH multivalue	94.60	82.20	500	using `lm_eval`
RULER variable tracking	77.92	3.92	500	using `lm_eval`
RULER common words extraction	24.72	78.14	500	using `lm_eval`
RULER frequent words extraction	78.80	64.73	500	using `lm_eval`

Intended use

This model is intended for:

Research and experimentation with latency-oriented Transformer architectures.
Evaluation of Laneformer / Delayed Tensor Parallelism design choices.
Causal language modeling and code-generation experiments.
Fine-tuning experiments, subject to all applicable license terms.
Inference-system and attention-backend testing.

This model is not automatically suitable for high-stakes settings such as medical, legal, financial, employment, education, public-sector, law-enforcement, or safety-critical decision-making.

Out-of-scope use

Do not use this model or tokenizer in ways that violate the repository licenses, the Llama 2 Community License, the Llama 2 Acceptable Use Policy, applicable law, privacy rights, or safety policies.

Do not present generated outputs as factual without independent verification. Do not use the model as the sole decision-maker in high-stakes workflows.

Limitations and risks

This is a 2B-class model and is not a frontier general-purpose assistant.
The model is primarily mid-trained and post-trained for code generation: it is not optimized as a broad general-purpose assistant.
The model may generate incorrect, insecure, biased, toxic, or misleading content.
The model may hallucinate facts, APIs, package names, citations, and code behavior.
The base model may not follow instructions reliably unless it has been instruction-tuned.
The public HF implementation prioritizes compatibility and inspection, not the full Kog Inference Engine latency path.
Performance may depend on the exact PyTorch, Transformers, device, dtype, and attention backend used.
Because this repo uses custom code, downstream users should review code before execution and pin a revision in production.

License

This repository is multi-license.

Kog-owned materials: Apache License 2.0

Unless a file states otherwise, the model weights, Hugging Face custom modeling and configuration code, model configuration files, metadata files, documentation, and README/model card are released under the Apache License 2.0.

See LICENSE for the full Apache License 2.0 text.

Tokenizer materials: Llama 2 Community License

The tokenizer files are based on the Llama 2 tokenizer and are not licensed under Apache License 2.0. These tokenizer materials include, without limitation:

tokenizer.model
tokenizer.json
tokenizer_config.json
special_tokens_map.json
any other file whose purpose is to reproduce or configure the Llama 2 tokenizer

These tokenizer materials are distributed under the LLAMA 2 Community License Agreement. See THIRD_PARTY_LICENSES/LLAMA2_LICENSE and NOTICE.

Required Llama 2 attribution notice:

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Users and redistributors are responsible for complying with the Llama 2 Community License and its Acceptable Use Policy.

Citation

If you use this model or architecture, please cite the model repository and the relevant Kog technical posts:

@misc{kog_laneformer_2b_it_2026,
  title        = {Kog Laneformer 2B Instruct Model},
  author       = {Kog Team},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kogai/laneformer-2b-it}}
}

@online{kog_laneformer_2b_2026,
  title        = {{Laneformer 2B: The Latency-First Model Behind Kog Inference Engine}},
  author       = {Kog Team},
  year         = {2026},
  url          = {https://huggingface.co/blog/kogai/kog-laneformer-2b-the-latency-first-model}
  note         = {HuggingFace blog}
}

@misc{kog_real_time_llm_inference_2026,
  title        = {Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)},
  author       = {Kog Team},
  year         = {2026},
  howpublished = {\url{https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/}}
}

@misc{kog_delayed_tensor_parallelism_2026,
  title        = {Delayed Tensor Parallelism for Faster Transformer Inference},
  author       = {Kog Team},
  year         = {2026},
  howpublished = {\url{https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference/}}
}