Instructions to use mkurman/convgpt-v2-b200-full-synth-20h with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mkurman/convgpt-v2-b200-full-synth-20h with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mkurman/convgpt-v2-b200-full-synth-20h")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mkurman/convgpt-v2-b200-full-synth-20h", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mkurman/convgpt-v2-b200-full-synth-20h with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mkurman/convgpt-v2-b200-full-synth-20h"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/convgpt-v2-b200-full-synth-20h",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/mkurman/convgpt-v2-b200-full-synth-20h

SGLang

How to use mkurman/convgpt-v2-b200-full-synth-20h with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mkurman/convgpt-v2-b200-full-synth-20h" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/convgpt-v2-b200-full-synth-20h",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mkurman/convgpt-v2-b200-full-synth-20h" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/convgpt-v2-b200-full-synth-20h",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use mkurman/convgpt-v2-b200-full-synth-20h with Docker Model Runner:
```
docker model run hf.co/mkurman/convgpt-v2-b200-full-synth-20h
```

ConvGPT-v2 B200 Full SYNTH 20h

mkurman/convgpt-v2-b200-full-synth-20h is an experimental ConvGPT-v2 language model checkpoint trained on synthetic reasoning/chat data. The latest uploaded checkpoint is checkpoint-61000, with intermediate checkpoints available from roughly checkpoint-47000 through checkpoint-61000.

This model is primarily a research artifact for testing convolution-first language modeling at scale. It is not a standard Transformer: ConvGPT-v2 replaces dense self-attention blocks with a hybrid causal 1D/2D convolutional architecture plus sparse chunk-token retrieval memory.

Reproduction - important!

Use the tokenizer from the tokenizer directory, as the one in the checkpoint files is incorrect.

Model Details

Architecture: ConvGPT-v2 custom causal language model
Latest checkpoint: checkpoint-134000
Training run: B200 180GB full-SYNTH 20h run + 72h local (RTX 3090ti, seq 1024, ebs 64, lr 1e-4)
Training data: PleIAs/SYNTH-style synthetic reasoning/chat examples
Approx. tokens seen: ~31.98B tokens
Reported CE loss: ~1.753 nats/token
Approx. perplexity: exp(1.753) ≈ 5.77
Vocab size: 32,024
Training sequence length: 512->2048
Configured max position capacity: 65,536 tokens via 256×256 grid
Precision: BF16 training

Architecture Summary

ConvGPT-v2 is a dense-self-attention-free experimental LM. It uses:

causal 1D convolution branch
causal 2D convolution branch over a 256×256 Hilbert-packed token grid
gated fusion between 1D and 2D branches
RoPE/no-position hybrid configuration
sparse chunk_token_memory retrieval every 2 layers
custom Triton kernels for active causal 2D gathering and sparse retrieval paths

Important distinction: the model is free of standard dense Transformer self-attention, but it is not strictly “attention-free” because the chunk_token_memory router performs sparse attention-like retrieval over selected prior chunks/tokens.

Repository Contents

This repository includes:

checkpoint-61000/ — latest model checkpoint
earlier checkpoints from approximately checkpoint-47000 to checkpoint-61000
ConvGPT-v2 source files:
- modeling_convgpt_v2.py
- configuration_convgpt_v2.py
- registration.py
- __init__.py
training scripts:
- train_convgpt_v2_2d_pleias_long.py
- run_convgpt_v2_b200_full_synth_20h.sh

Intended Use

This model is intended for:

research into convolutional / dense-attention-free language model architectures
experiments with sparse retrieval memory as an alternative to Transformer self-attention
studying scaling behavior of ConvGPT-v2 on synthetic reasoning/chat data
checkpoint analysis, BPB/perplexity evaluation, and generation experiments

It is not intended for production use, medical advice, legal advice, or safety-critical applications.

Usage

Because this is a custom architecture, load with trusted remote code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "mkurman/convgpt-v2-b200-full-synth-20h"
checkpoint = f"{repo}/checkpoint-134000"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda().eval()

messages = [{"role": "user", "content": "Explain hypertension briefly."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=False))

Training Notes

The model was trained using a custom ConvGPT-v2 trainer on synthetic reasoning data. The uploaded shell and Python scripts document the exact training setup, including:

B200-oriented BF16 training
full PleIAs/SYNTH streaming mode
causal full-token loss
checkpoint upload support to Hugging Face Hub

The latest run reached approximately:

checkpoint: checkpoint-61000
cross entropy: ~1.753 nats/token
tokens seen: ~31,981,568,000
per-token perplexity: ~5.77
bits/token: ~2.53

Limitations

This is an experimental research checkpoint. Known limitations:

generation quality is not yet comparable to mature Transformer LMs
may hallucinate or produce malformed reasoning
trained mostly on synthetic data PleIAs/SYNTH, so distributional coverage is limited to this dataset
not instruction-safety tuned
custom architecture requires trust_remote_code=True
sparse retrieval memory is attention-like, so the model should not be described as strictly attention-free

Citation / Attribution

If you use this model, please refer to it as:

ConvGPT-v2 B200 Full SYNTH 20h — an experimental dense-self-attention-free convolutional sparse-retrieval language model by Mariusz Kurman.

Downloads last month: -; Downloads are not tracked for this model. How to track

mkurman
/

convgpt-v2-b200-full-synth-20h