Instructions to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mkurman/ConvGPT-0.2B-SYNTH-250B-EC", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("mkurman/ConvGPT-0.2B-SYNTH-250B-EC", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mkurman/ConvGPT-0.2B-SYNTH-250B-EC" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mkurman/ConvGPT-0.2B-SYNTH-250B-EC
- SGLang
How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mkurman/ConvGPT-0.2B-SYNTH-250B-EC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mkurman/ConvGPT-0.2B-SYNTH-250B-EC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkurman/ConvGPT-0.2B-SYNTH-250B-EC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mkurman/ConvGPT-0.2B-SYNTH-250B-EC with Docker Model Runner:
docker model run hf.co/mkurman/ConvGPT-0.2B-SYNTH-250B-EC
ConvGPT 164M SYNTH EC 250B TOKENS
This is an Early Checkpoint (EC) of the ConvGPT architecture, a novel model designed for maximal hidden size compression.
Model Details
- Architecture: ConvGPT
- Checkpoint Step: 172,000
- Parameters: 163,952,769
- Num layers: 32
- Hidden size: 1296
- Transformer dimension: 144
- Vocab size: 65538
- Intermediate size: 3072
- Num attention heads: 16
- Num kv heads: 8
- Head dim: 128
- Tie word embeddings: True
Architecture Highlights
ConvGPT introduces a novel approach to Large Language Model compression by integrating 2D convolutional networks directly into the pre-training architecture, rather than relying on post-training quantization or pruning. Designed specifically for Mobile/Edge (SLM) use cases, it achieves significant parameter reduction while maintaining high reasoning capabilities.
- Convolutional Embedding Compression: Unlike standard Transformers that maintain a constant hidden size throughout, ConvGPT utilizes a Conv2D + Average Pooling layer to compress the input hidden state vector by a factor of 9x before it enters the residual stream. This allows the model to maintain high-dimensional information in the embedding layer and prediction head while operating on a highly efficient, smaller vector in the decoder layers.
- Causal masking in 2D: The architecture implements specialized padding and reshaping mechanisms during the convolution steps to strictly preserve autoregressive causality. This eliminates "token leakage" (look-ahead bias), ensuring the model remains robust during generation and prevents the test-time degradation often seen in naive convolutional language models.
Extreme Parameter Efficiency:
Current Model: 164M parameters (comparable performance to a standard 722M parameter architecture) - a ~4.4x size reduction.
Scaling Potential: The architecture scales efficiently; a configuration with
hidden_size=2048results in just 266M parameters compared to a 1.7B parameter baseline (a 6.5x reduction).Performance-to-Size Ratio: Trained on 250B tokens (PleIAs/SYNTH), this 164M model achieves >30% on GPQA-Diamond, a significant outlier for its size class, demonstrating that logic and reasoning capabilities can be preserved even with aggressive vector compression.
Normalization Stability: Includes post-convolution normalization to manage vector value scaling, ensuring training stability and consistent generation output.
Training Details
This model is currently being trained using the Google TPU Research Cloud (TRC).
- Dataset: PleIAs/SYNTH
- Tokens Processed: ~250 Billion
- Hardware: TPUv4-16
- Training Time: ~30 Days
- Effective Batch Size: 512
- Context Length: 4096 tokens
- Learning rate: P1: 1e-3 (75B), P2: 1e-4 (175B)
- Weight decay: P1: 0.0, P2: 0.01
- Optimizer: AdamW
- Precision: BFloat16
Usage
Note: You must use trust_remote_code=True as this model utilizes custom modeling code (modeling_convgpt.py).
import torch
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
model_id = "mkurman/ConvGPT-SYNTH-250B-EC"
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model with custom code trust
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map='cuda',
trust_remote_code=True
).eval()
streamer = TextStreamer(
tokenizer, skip_prompt=False, decode_kwargs={"skip_special_tokens": False}
)
# Prepare input
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": "what is hypertension?"}],
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
)
print(f"Input IDs: {input_ids}")
# Generate
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids.to(model.device),
max_new_tokens=128,
streamer=streamer,
use_cache=True,
# Important: Keep repetition_penalty at 1.0 for this early checkpoint
repetition_penalty=1.0,
)
You can also find support for vLLM and SGLang in my GitHub repository.
Acknowledgments
This model was trained using Cloud TPUs provided by Google's TPU Research Cloud (TRC) program.
Special thanks to Pierre-Carl Langlais and the PleIAs team for the high-quality SYNTH dataset.
Repo
- Downloads last month
- 30