Instructions to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mkurman/ConvGPT-0.2B-SYNTH-250B-EC", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mkurman/ConvGPT-0.2B-SYNTH-250B-EC", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mkurman/ConvGPT-0.2B-SYNTH-250B-EC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mkurman/ConvGPT-0.2B-SYNTH-250B-EC

SGLang

How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mkurman/ConvGPT-0.2B-SYNTH-250B-EC" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mkurman/ConvGPT-0.2B-SYNTH-250B-EC" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with Docker Model Runner:
```
docker model run hf.co/mkurman/ConvGPT-0.2B-SYNTH-250B-EC
```

ConvGPT 164M SYNTH EC 250B TOKENS

⚠️ EXPERIMENTAL EARLY CHECKPOINT ⚠️

This is an Early Checkpoint (EC) of the ConvGPT architecture, a novel model designed for maximal hidden size compression.

Model Details

Architecture: ConvGPT
Checkpoint Step: 172,000
Parameters: 163,952,769
Num layers: 32
Hidden size: 1296
Transformer dimension: 144
Vocab size: 65538
Intermediate size: 3072
Num attention heads: 16
Num kv heads: 8
Head dim: 128
Tie word embeddings: True

Architecture Highlights

ConvGPT introduces a novel approach to Large Language Model compression by integrating 2D convolutional networks directly into the pre-training architecture, rather than relying on post-training quantization or pruning. Designed specifically for Mobile/Edge (SLM) use cases, it achieves significant parameter reduction while maintaining high reasoning capabilities.

Convolutional Embedding Compression: Unlike standard Transformers that maintain a constant hidden size throughout, ConvGPT utilizes a Conv2D + Average Pooling layer to compress the input hidden state vector by a factor of 9x before it enters the residual stream. This allows the model to maintain high-dimensional information in the embedding layer and prediction head while operating on a highly efficient, smaller vector in the decoder layers.
Causal masking in 2D: The architecture implements specialized padding and reshaping mechanisms during the convolution steps to strictly preserve autoregressive causality. This eliminates "token leakage" (look-ahead bias), ensuring the model remains robust during generation and prevents the test-time degradation often seen in naive convolutional language models.

Extreme Parameter Efficiency:

Current Model: 164M parameters (comparable performance to a standard 722M parameter architecture) - a ~4.4x size reduction.
Scaling Potential: The architecture scales efficiently; a configuration with hidden_size=2048 results in just 266M parameters compared to a 1.7B parameter baseline (a 6.5x reduction).
Performance-to-Size Ratio: Trained on 250B tokens (PleIAs/SYNTH), this 164M model achieves >30% on GPQA-Diamond, a significant outlier for its size class, demonstrating that logic and reasoning capabilities can be preserved even with aggressive vector compression.
Normalization Stability: Includes post-convolution normalization to manage vector value scaling, ensuring training stability and consistent generation output.

Training Details

This model is currently being trained using the Google TPU Research Cloud (TRC).

Dataset: PleIAs/SYNTH
Tokens Processed: ~250 Billion
Hardware: TPUv4-16
Training Time: ~30 Days
Effective Batch Size: 512
Context Length: 4096 tokens
Learning rate: P1: 1e-3 (75B), P2: 1e-4 (175B)
Weight decay: P1: 0.0, P2: 0.01
Optimizer: AdamW
Precision: BFloat16

Usage

Note: You must use trust_remote_code=True as this model utilizes custom modeling code (modeling_convgpt.py).

import torch
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM

model_id = "mkurman/ConvGPT-SYNTH-250B-EC"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model with custom code trust
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map='cuda', 
    trust_remote_code=True
).eval()

streamer = TextStreamer(
    tokenizer, skip_prompt=False, decode_kwargs={"skip_special_tokens": False}
)

# Prepare input
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": "what is hypertension?"}], 
    tokenize=True, 
    return_tensors="pt", 
    add_generation_prompt=True
)

print(f"Input IDs: {input_ids}")

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids.to(model.device),
        max_new_tokens=128,
        streamer=streamer,
        use_cache=True,
        # Important: Keep repetition_penalty at 1.0 for this early checkpoint
        repetition_penalty=1.0, 
    )

You can also find support for vLLM and SGLang in my GitHub repository.

Acknowledgments

This model was trained using Cloud TPUs provided by Google's TPU Research Cloud (TRC) program.

Special thanks to Pierre-Carl Langlais and the PleIAs team for the high-quality SYNTH dataset.

Repo

GitHub: https://github.com/mkurman/convgpt

Downloads last month: 30

mkurman
/

ConvGPT-0.2B-SYNTH-250B-EC