ConvGPT-v2 B200 Full SYNTH 20h

mkurman/convgpt-v2-b200-full-synth-20h is an experimental ConvGPT-v2 language model checkpoint trained on synthetic reasoning/chat data. The latest uploaded checkpoint is checkpoint-61000, with intermediate checkpoints available from roughly checkpoint-47000 through checkpoint-61000.

This model is primarily a research artifact for testing convolution-first language modeling at scale. It is not a standard Transformer: ConvGPT-v2 replaces dense self-attention blocks with a hybrid causal 1D/2D convolutional architecture plus sparse chunk-token retrieval memory.

Reproduction - important!

Use the tokenizer from the tokenizer directory, as the one in the checkpoint files is incorrect.

Model Details

  • Architecture: ConvGPT-v2 custom causal language model
  • Latest checkpoint: checkpoint-134000
  • Training run: B200 180GB full-SYNTH 20h run + 72h local (RTX 3090ti, seq 1024, ebs 64, lr 1e-4)
  • Training data: PleIAs/SYNTH-style synthetic reasoning/chat examples
  • Approx. tokens seen: ~31.98B tokens
  • Reported CE loss: ~1.753 nats/token
  • Approx. perplexity: exp(1.753) ≈ 5.77
  • Vocab size: 32,024
  • Training sequence length: 512->2048
  • Configured max position capacity: 65,536 tokens via 256×256 grid
  • Precision: BF16 training

Architecture Summary

ConvGPT-v2 is a dense-self-attention-free experimental LM. It uses:

  • causal 1D convolution branch
  • causal 2D convolution branch over a 256×256 Hilbert-packed token grid
  • gated fusion between 1D and 2D branches
  • RoPE/no-position hybrid configuration
  • sparse chunk_token_memory retrieval every 2 layers
  • custom Triton kernels for active causal 2D gathering and sparse retrieval paths

Important distinction: the model is free of standard dense Transformer self-attention, but it is not strictly “attention-free” because the chunk_token_memory router performs sparse attention-like retrieval over selected prior chunks/tokens.

Repository Contents

This repository includes:

  • checkpoint-61000/ — latest model checkpoint
  • earlier checkpoints from approximately checkpoint-47000 to checkpoint-61000
  • ConvGPT-v2 source files:
    • modeling_convgpt_v2.py
    • configuration_convgpt_v2.py
    • registration.py
    • __init__.py
  • training scripts:
    • train_convgpt_v2_2d_pleias_long.py
    • run_convgpt_v2_b200_full_synth_20h.sh

Intended Use

This model is intended for:

  • research into convolutional / dense-attention-free language model architectures
  • experiments with sparse retrieval memory as an alternative to Transformer self-attention
  • studying scaling behavior of ConvGPT-v2 on synthetic reasoning/chat data
  • checkpoint analysis, BPB/perplexity evaluation, and generation experiments

It is not intended for production use, medical advice, legal advice, or safety-critical applications.

Usage

Because this is a custom architecture, load with trusted remote code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "mkurman/convgpt-v2-b200-full-synth-20h"
checkpoint = f"{repo}/checkpoint-134000"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda().eval()

messages = [{"role": "user", "content": "Explain hypertension briefly."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=False))

Training Notes

The model was trained using a custom ConvGPT-v2 trainer on synthetic reasoning data. The uploaded shell and Python scripts document the exact training setup, including:

  • B200-oriented BF16 training
  • full PleIAs/SYNTH streaming mode
  • causal full-token loss
  • checkpoint upload support to Hugging Face Hub

The latest run reached approximately:

checkpoint: checkpoint-61000
cross entropy: ~1.753 nats/token
tokens seen: ~31,981,568,000
per-token perplexity: ~5.77
bits/token: ~2.53

Limitations

This is an experimental research checkpoint. Known limitations:

  • generation quality is not yet comparable to mature Transformer LMs
  • may hallucinate or produce malformed reasoning
  • trained mostly on synthetic data PleIAs/SYNTH, so distributional coverage is limited to this dataset
  • not instruction-safety tuned
  • custom architecture requires trust_remote_code=True
  • sparse retrieval memory is attention-like, so the model should not be described as strictly attention-free

Citation / Attribution

If you use this model, please refer to it as:

ConvGPT-v2 B200 Full SYNTH 20h — an experimental dense-self-attention-free convolutional sparse-retrieval language model by Mariusz Kurman.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mkurman/convgpt-v2-b200-full-synth-20h