Instructions to use kogai/laneformer-2b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kogai/laneformer-2b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kogai/laneformer-2b-it", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("kogai/laneformer-2b-it", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kogai/laneformer-2b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kogai/laneformer-2b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kogai/laneformer-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kogai/laneformer-2b-it
- SGLang
How to use kogai/laneformer-2b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kogai/laneformer-2b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kogai/laneformer-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kogai/laneformer-2b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kogai/laneformer-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kogai/laneformer-2b-it with Docker Model Runner:
docker model run hf.co/kogai/laneformer-2b-it
Kog Laneformer 2B
Kog Laneformer 2B is a latency-oriented Transformer variant built using Delayed Tensor Parallelism (DTP) to overlap tensor-parallel communication with useful computation and weight streaming.
The model is intended to make our Laneformer architecture available on Hugging Face for research, inspection, and fine-tuning. It is also the architecture family used in our public inference-engine preview, where single-request decoding speeds are 3,000 output tokens/s per request on 8x AMD MI300X and 2,100 output tokens/s per request on 8x NVIDIA H200 in FP16 without speculative decoding.
Those figures are Kog Inference Engine (KIE) benchmark results, not expected performance from the generic Transformers runtime.
For background, see:
- Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)
- Delayed Tensor Parallelism for Faster Transformer Inference
Custom code notice: this model architecture is not currently part of upstream Transformers. Loading with
AutoConfig,AutoModel, orAutoModelForCausalLMrequirestrust_remote_code=True. For production or security-sensitive use, review the modeling code and pin a specific Hub commit hash withrevision="<commit-hash>".
Why this model exists
Many LLM serving systems optimize aggregate throughput across many concurrent requests. Our Laneformer work targets a different regime: low-batch, single-request decode speed, which is important for agentic coding loops, real-time copilots, voice assistants, and other sequential workflows where each generated step gates the next one.
At batch size 1, autoregressive decoding is often constrained by memory bandwidth, synchronization, kernel launch overheads, and communication latency rather than raw FLOPs. Laneformer uses a lane-structured architecture and delayed communication pattern so tensor-parallel work can be organized around the latency structure of full-node GPU inference.
Architecture overview
Laneformer follows a Llama-style decoder-only architecture, but restructures tensor-parallel communication around lanes.
In standard tensor parallelism, each attention or MLP block typically requires communication before downstream computation can proceed. In Delayed Tensor Parallelism (DTP), local outputs are communicated asynchronously and consumed several modules later, allowing communication latency to be hidden behind subsequent computation and weight streaming.
| Field | Value |
|---|---|
| Model family | Laneformer |
| HF model type | laneformer |
| Architecture class | LaneformerForCausalLM |
| Task | Decoder-only causal language modeling |
| Parameters | ~2.3B |
| Hidden size | 3072 |
| Intermediate size | 12288 |
| MLP type | SwiGLU |
| Decoder layers | 15 |
| Attention heads | 32 |
| KV heads | 16 |
| Context length | 4096 tokens |
| Sliding window | 2048 tokens |
| Sliding-window layers | 0-9 |
| Full-attention layers | 10-14 |
| Vocabulary size | 32000 |
| Tokenizer | Llama 2 tokenizer |
| Number of lanes | 8 |
| DTP / Broadcast delay | 2 |
| LM head type | vocab_parallel |
| RoPE theta | 10000 |
| Tied word embeddings | false |
| Expected Transformers version | >=4.57.1 |
| Published weight dtype | BFloat16 |
The Hugging Face implementation is a reference/compatibility implementation. It is useful for loading, generation, inspection, and downstream experimentation, but it is not the same execution path as our production inference engine, which uses a latency-optimized runtime and low-level GPU kernels optimization.
Installation
Nvidia
uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf
AMD
uv venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
uv pip install "transformers>=4.57.1" accelerate safetensors sentencepiece protobuf
Usage
This repository includes chat_template.jinja; use tokenizer.apply_chat_template for chat and instruction prompts.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
model_id = "kogai/laneformer-2b-it"
revision = "main" # this model uses remote custom code, pin a reviewed commit hash for reproducible and safer usage
seed = 42
temperature = 0.8
print(f"Using random seed {seed}.")
set_seed(seed)
print(f"Loading tokenizer for {model_id} ({revision=})...")
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
revision=revision,
)
print("Loading model. This may take a few minutes on first run while files download...")
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
revision=revision,
dtype="bfloat16",
device_map="auto",
)
model.eval()
print(f"Model loaded on {model.device}.")
# The Llama 2 tokenizer has no native pad token. If this repo sets PAD to EOS,
# always pass attention_mask for batched generation.
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print("Tokenizer had no pad token; using EOS as the pad token.")
messages = [
{
"role": "user",
"content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
},
]
print("\nPrompt:")
print(messages[0]["content"])
print("\nApplying chat template...")
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
print(f"Prompt token count: {inputs.input_ids.shape[-1]}")
print(f"Generating response with sampling enabled ({temperature=})...")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=temperature,
use_cache=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated = output_ids[:, inputs.input_ids.shape[-1]:]
print("\nGenerated response:")
print("-" * 80)
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])
Tokenizer & Chat template
This repository uses the Llama 2 tokenizer with a 32,000-token vocabulary. The expected special-token convention is:
| Token | ID |
|---|---|
<unk> |
0 |
<s> |
1 |
</s> |
2 |
Instruction-tuned models in this family, such as kogai/laneformer-2b-it, include a repository-level chat_template.jinja.
For the current instruction-tuned template:
- Only
userandassistantroles are supported. - System messages are ignored.
- Each user or assistant message is formatted with a Llama-3-style role header, followed by trimmed message content and
</s>.
Usage:
messages = [
{
"role": "user",
"content": "Explain how a binary heap works, then write a complete min-heap implementation from scratch in Python.",
},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
Training data and procedure
Laneformer 2B was trained with a staged language-modeling recipe:
- Broad pre-training to build general language-modeling capabilities.
- Code- and reasoning-heavy continuation training.
- Lightweight post-training to produce the released instruction-tuned model.
The TorchTitan training stages used sequence length 4,096 and global batch size 1,536, or approximately 6.29M tokens per optimizer step.
Training stages
| Stage | Focus | Framework | Tokens | Steps | Approx. tokens / step | Notes |
|---|---|---|---|---|---|---|
| Pre-training | Broad language-model pre-training | TorchTitan | ~4T | 620,000 | ~6.29M | Nemotron-derived broad mixture; final checkpoint selected. |
| Mid-training | Code- and reasoning-focused continuation | TorchTitan | ~2T | 310,000 | ~6.29M | Continued from the pre-training checkpoint with a code/reasoning-heavy mixture. |
| Post-training | SFT, instruction tuning, identity tuning | Hugging Face Transformers | ~210M | 200 | ~1.05M | Lightweight custom data mixture initialized from the mid-training checkpoint. |
Data sources
Both TorchTitan stages depend primarily on NVIDIA Nemotron pre-training datasets. Pre-training uses a Nemotron-CC-v2-centered mixture for broad language-model pretraining, with web, synthetic web, QA, math, SFT-style, and code components. Mid-training continues from the pre-training checkpoint on a more code- and reasoning-heavy Nemotron-derived mixture, increasing the emphasis on code metadata, synthetic code data, Common Crawl code pages, math, STEM, and reasoning data.
| Category | Pre-training / Phase 1 | Mid-training / Phase 2 | Notes |
|---|---|---|---|
| General / web crawl | 63.2% | 10.0% | Phase 2 keeps a smaller high-quality and synthetic web component. |
| Code | 21.0% | 59.01% | Large increase from code metadata, synthetic code data, and Common Crawl code pages. |
| Math / STEM / reasoning | 6.3% | 30.01% | Large increase from math, RQA, STEM, and textbook-like data. |
| Multilingual | 5.0% | 0.0% | Removed in the current Phase 2 mixture. |
| QA / instruction / academic | 4.5% | 1.0% | Only a small general SFT-style component remains. |
Training infrastructure
| Field | Value |
|---|---|
| Pre-training framework | TorchTitan |
| Post-training framework | Hugging Face Transformers |
| Training hardware | 24 nodes × 8 NVIDIA H100 GPUs = 192 H100 GPUs |
| Training clusters | Scaleway cluster; ADASTRA cluster |
| Cloud/provider/region | France |
| Training time | ~21 days |
| Precision | FP32/BF16 mixed precision |
| Parallelism | FSDP |
| Optimizer | AdamW |
| Learning-rate schedule | WSD |
| Sequence length during training | 4096 |
| Checkpoint selection | final step |
Inference and performance
Kog Inference Engine running the Laneformer 2B model public-preview numbers:
| Setting | Reported speed | Notes |
|---|---|---|
| 8x AMD MI300X | 3,000 output tokens/s/request | FP16, batch size 1, no speculative decoding |
| 8x NVIDIA H200 | 2,100 output tokens/s/request | FP16, batch size 1, no speculative decoding |
These benchmark numbers are for Kog's optimized inference stack, not the plain Hugging Face Transformers implementation in this repository. This preview does not rely on quantization, speculative decoding, pruning, early exit, or KV-cache compression to reach this speed.
Evaluation
The following internal Kog evaluations were run in June 2026 against a local Kog serving endpoint for the Laneformer 2B instruction-tuned checkpoint at batch size 1 in FP16. These values evaluate the served preview model; small numerical differences may appear when evaluating the BF16 Hugging Face checkpoint directly through a different runtime.
Code generation
HumanEval+ and MBPP+ were evaluated with greedy decoding. Generation was performed through the Kog serving endpoint, scoring used EvalPlus.
A custom code-block selection step named target_function was applied before scoring. When possible, this step selects the code block containing the target function name before EvalPlus preprocessing.
| Benchmark | Metric | Value | Samples | Decoding | Scoring | Postprocessing |
|---|---|---|---|---|---|---|
| HumanEval+ | pass@1 | 45.1 | 164 | Greedy, temperature=0, do_sample=False |
EvalPlus | target_function block selection |
| MBPP+ | pass@1 | 51.6 | 378 | Greedy, temperature=0, do_sample=False |
EvalPlus | target_function block selection |
General multiple-choice checks
ARC-Challenge and ARC-Easy were evaluated with 0-shot multiple-choice logprobs. Orchestration used lm-evaluation-harness 0.4.12 against the Kog serving endpoint.
| Benchmark | Metric | Value | Samples | Setting |
|---|---|---|---|---|
| ARC-Challenge | Normalized accuracy | 31.06 | 1,172 | 0-shot multiple-choice logprobs, lm_eval |
| ARC-Easy | Normalized accuracy | 47.90 | 2,376 | 0-shot multiple-choice logprobs, lm_eval |
Long-context synthetic checks
Long-context checks were also run with RULER-style synthetic tasks at 2,048 and 4,096 tokens using 0-shot greedy chat generation using the LM Evaluation Harness framework. Values below are string-match scores.
| Task | 2,048 tokens | 4,096 tokens | Effective samples | Setting |
|---|---|---|---|---|
| NIAH single 1 | 100.00 | 100.00 | 500 | using lm_eval |
| NIAH single 2 | 100.00 | 100.00 | 500 | using lm_eval |
| NIAH single 3 | 99.80 | 91.60 | 500 | using lm_eval |
| NIAH multikey 1 | 99.20 | 83.40 | 500 | using lm_eval |
| NIAH multikey 2 | 97.80 | 60.00 | 500 | using lm_eval |
| NIAH multikey 3 | 91.80 | 91.20 | 500 | using lm_eval |
| NIAH multiquery | 95.40 | 69.70 | 500 | using lm_eval |
| NIAH multivalue | 94.60 | 82.20 | 500 | using lm_eval |
| RULER variable tracking | 77.92 | 3.92 | 500 | using lm_eval |
| RULER common words extraction | 24.72 | 78.14 | 500 | using lm_eval |
| RULER frequent words extraction | 78.80 | 64.73 | 500 | using lm_eval |
Intended use
This model is intended for:
- Research and experimentation with latency-oriented Transformer architectures.
- Evaluation of Laneformer / Delayed Tensor Parallelism design choices.
- Causal language modeling and code-generation experiments.
- Fine-tuning experiments, subject to all applicable license terms.
- Inference-system and attention-backend testing.
This model is not automatically suitable for high-stakes settings such as medical, legal, financial, employment, education, public-sector, law-enforcement, or safety-critical decision-making.
Out-of-scope use
Do not use this model or tokenizer in ways that violate the repository licenses, the Llama 2 Community License, the Llama 2 Acceptable Use Policy, applicable law, privacy rights, or safety policies.
Do not present generated outputs as factual without independent verification. Do not use the model as the sole decision-maker in high-stakes workflows.
Limitations and risks
- This is a 2B-class model and is not a frontier general-purpose assistant.
- The model is primarily mid-trained and post-trained for code generation: it is not optimized as a broad general-purpose assistant.
- The model may generate incorrect, insecure, biased, toxic, or misleading content.
- The model may hallucinate facts, APIs, package names, citations, and code behavior.
- The base model may not follow instructions reliably unless it has been instruction-tuned.
- The public HF implementation prioritizes compatibility and inspection, not the full Kog Inference Engine latency path.
- Performance may depend on the exact PyTorch, Transformers, device, dtype, and attention backend used.
- Because this repo uses custom code, downstream users should review code before execution and pin a revision in production.
License
This repository is multi-license.
Kog-owned materials: Apache License 2.0
Unless a file states otherwise, the model weights, Hugging Face custom modeling and configuration code, model configuration files, metadata files, documentation, and README/model card are released under the Apache License 2.0.
See LICENSE for the full Apache License 2.0 text.
Tokenizer materials: Llama 2 Community License
The tokenizer files are based on the Llama 2 tokenizer and are not licensed under Apache License 2.0. These tokenizer materials include, without limitation:
tokenizer.modeltokenizer.jsontokenizer_config.jsonspecial_tokens_map.json- any other file whose purpose is to reproduce or configure the Llama 2 tokenizer
These tokenizer materials are distributed under the LLAMA 2 Community License Agreement. See THIRD_PARTY_LICENSES/LLAMA2_LICENSE and NOTICE.
Required Llama 2 attribution notice:
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Users and redistributors are responsible for complying with the Llama 2 Community License and its Acceptable Use Policy.
Citation
If you use this model or architecture, please cite the model repository and the relevant Kog technical posts:
@misc{kog_laneformer_2b_it_2026,
title = {Kog Laneformer 2B Instruct Model},
author = {Kog Team},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/kogai/laneformer-2b-it}}
}
@online{kog_laneformer_2b_2026,
title = {{Laneformer 2B: The Latency-First Model Behind Kog Inference Engine}},
author = {Kog Team},
year = {2026},
url = {https://huggingface.co/blog/kogai/kog-laneformer-2b-the-latency-first-model}
note = {HuggingFace blog}
}
@misc{kog_real_time_llm_inference_2026,
title = {Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)},
author = {Kog Team},
year = {2026},
howpublished = {\url{https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/}}
}
@misc{kog_delayed_tensor_parallelism_2026,
title = {Delayed Tensor Parallelism for Faster Transformer Inference},
author = {Kog Team},
year = {2026},
howpublished = {\url{https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference/}}
}
Contact
For questions about this model, open a discussion or issue on the Hugging Face repository, or contact Kog AI through the channels listed on kog.ai.
- Downloads last month
- 181
Model tree for kogai/laneformer-2b-it
Datasets used to train kogai/laneformer-2b-it
nvidia/Nemotron-Pretraining-Code-v2
nvidia/Nemotron-CC-v2
Article mentioning kogai/laneformer-2b-it
Evaluation results
- openai/openai_humaneval · Default View evaluation results
- evalplus/humanevalplus · Default View evaluation results