Instructions to use veyra-ai/Veyra2-30M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use veyra-ai/Veyra2-30M-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="veyra-ai/Veyra2-30M-Base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("veyra-ai/Veyra2-30M-Base")
model = AutoModelForCausalLM.from_pretrained("veyra-ai/Veyra2-30M-Base")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use veyra-ai/Veyra2-30M-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "veyra-ai/Veyra2-30M-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/Veyra2-30M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/veyra-ai/Veyra2-30M-Base

SGLang

How to use veyra-ai/Veyra2-30M-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "veyra-ai/Veyra2-30M-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/Veyra2-30M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "veyra-ai/Veyra2-30M-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "veyra-ai/Veyra2-30M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use veyra-ai/Veyra2-30M-Base with Docker Model Runner:
```
docker model run hf.co/veyra-ai/Veyra2-30M-Base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Veyra2 30M Base 2B Tokens

Veyra2 30M Base 2B Tokens is a compact English causal language model trained from scratch for fast local inference, experimentation, and downstream fine-tuning.

This release is the first production checkpoint in the Veyra2 30M line. It replaces the older custom Veyra 30M architecture with a Llama-compatible architecture, making it easier to convert to GGUF, run locally, and build downstream instruct or tool-use variants.

The model is a base language model, not an instruction-tuned assistant. It is best used for text completion, continued pretraining, evaluation, and as a starting point for fine-tuning.

Training loss over 2B tokens

Model details

Field	Value
Parameters	34,611,712
Model family name	Veyra2 30M
Architecture	Llama-compatible causal LM
Layers	8
Hidden size	512
Intermediate size	2048
Attention heads	8
KV heads	2
Context length	1024 tokens
Vocabulary size	8192
Training tokens	2,000,158,720
Precision during training	bfloat16 model weights with fp32 optimizer state
License	Apache 2.0

Although the model is referred to as Veyra2 30M for continuity with the Veyra 30M family, the exact parameter count is 34.6M.

Training data

Veyra2 30M Base 2B was trained on an English-heavy mixture designed for small-model local utility:

80% Cosmopedia v2 style educational and synthetic textbook data
10% Python/code-oriented data
10% chat/instruction-style data

The chat portion uses ChatML-style formatting, so the base model may sometimes continue ChatML conversations or emit ChatML special tokens. This is expected behavior for the base checkpoint and is useful for later instruction tuning, but this model should not be treated as a polished chat assistant.

Training setup

The model was trained for approximately 2B tokens with a 1024-token sequence length. The optimizer recipe used CosineGatedAdam for matrix parameters and AdamW for auxiliary parameters such as embeddings and normalization weights.

Final training logs near the end of the run were approximately:

Metric	Value
Final training loss band	~2.09-2.12
Final training perplexity band	~8.1-8.3
Average training speed	~235k tokens/sec
Peak VRAM during training	~43 GB

These training numbers are from the training stream and are not a replacement for downstream task evaluation.

Evaluation

Quick streamed eval

A quick streamed sanity eval was run on Cosmopedia-style data for 262,144 tokens.

Metric	Value
Eval tokens	262,144
Eval loss	2.2842
Eval perplexity	9.82

BLiMP

BLiMP was evaluated using the official nyu-mll/blimp dataset with mean token log-likelihood scoring.

Metric	Value
Total examples	67,000
Correct	42,809
Overall accuracy	63.89%

BLiMP measures targeted grammatical minimal-pair sensitivity. It should not be interpreted as a general capability benchmark.

Intended use

Veyra2 30M Base 2B is intended for:

local text completion experiments
lightweight CPU-friendly language modeling
downstream instruction tuning
small-model research
grammar, tokenizer, quantization, and local-inference experiments
building small ChatML, tool-use, Python, or function-calling variants

For direct assistant-style use, wait for an instruction-tuned Veyra2 model or fine-tune this base checkpoint yourself.

What to expect

This is a very small base model. You should expect coherent short completions, recognizable educational prose, some code-like continuations, and occasional ChatML continuation behavior.

You should not expect high factual reliability, robust reasoning, strong instruction following, safety alignment, or long-context consistency. The model may hallucinate, repeat itself, produce incorrect facts, or continue prompts in unexpected formats.

Example usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "veyra-ai/veyra2-30m-base-2b-tokens"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The purpose of a small language model is"

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False,
).to(model.device)

pad_token_id = tokenizer.pad_token_id
if pad_token_id is None:
    pad_token_id = tokenizer.eos_token_id

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        temperature=0.7,
        top_p=0.92,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=pad_token_id,
    )

# Decode only the newly generated tokens, not the prompt
new_tokens = output[0][inputs["input_ids"].shape[-1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

Experimental ChatML continuation

The base model has seen ChatML-style data during pretraining. You can experiment with prompts like:

<|im_start|>user
Explain what a stack is in simple terms.
<|im_end|>
<|im_start|>assistant

This is completion behavior, not instruction tuning. The model may continue the conversation format but should not be treated as a reliable assistant.

GGUF and quantization

Veyra2 30M 2B GGUF

Limitations

English-focused
Not instruction tuned
Not safety aligned
May hallucinate facts
May produce repetitive or malformed text
Limited context length of 1024 tokens
Small parameter count limits reasoning and world knowledge
ChatML behavior is learned as text continuation, not as a robust assistant policy

License

This model is released under the Apache 2.0 license unless otherwise noted. Please retain attribution to Veyra AI when redistributing models or releasing derivative work.

Citation / attribution

If you use this model, please refer to it as Veyra2 30M Base 2B Tokens by Veyra AI.

Downloads last month: 213

Safetensors

Model size

34.6M params

Tensor type

BF16

Model tree for veyra-ai/Veyra2-30M-Base

Quantizations

2 models

Datasets used to train veyra-ai/Veyra2-30M-Base

Collection including veyra-ai/Veyra2-30M-Base

Veyra2

Collection

The second generation of Veyra, these models are meant for local CPU inference. • 6 items • Updated 14 days ago