Instructions to use pavelfedortsov/gemma4-e4b-colloquial-ru-merged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pavelfedortsov/gemma4-e4b-colloquial-ru-merged with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pavelfedortsov/gemma4-e4b-colloquial-ru-merged") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("pavelfedortsov/gemma4-e4b-colloquial-ru-merged") model = AutoModelForImageTextToText.from_pretrained("pavelfedortsov/gemma4-e4b-colloquial-ru-merged") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use pavelfedortsov/gemma4-e4b-colloquial-ru-merged with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pavelfedortsov/gemma4-e4b-colloquial-ru-merged" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pavelfedortsov/gemma4-e4b-colloquial-ru-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pavelfedortsov/gemma4-e4b-colloquial-ru-merged
- SGLang
How to use pavelfedortsov/gemma4-e4b-colloquial-ru-merged with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pavelfedortsov/gemma4-e4b-colloquial-ru-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pavelfedortsov/gemma4-e4b-colloquial-ru-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pavelfedortsov/gemma4-e4b-colloquial-ru-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pavelfedortsov/gemma4-e4b-colloquial-ru-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pavelfedortsov/gemma4-e4b-colloquial-ru-merged with Docker Model Runner:
docker model run hf.co/pavelfedortsov/gemma4-e4b-colloquial-ru-merged
gemma4-e4b-colloquial-ru-merged
English: Full-weight Gemma 4 E4B checkpoint with colloquial Russian LoRA merged in for vLLM / RunPod Serverless. No PEFT at inference time.
What this model does
Rewrites formal Russian into casual chat-style Russian (Telegram-like), without profanity, while keeping facts, names, numbers, and paragraph structure.
Not a general chat model — use the instruction prefix from training (see below).
Model lineage
| Stage | Artifact |
|---|---|
| Base | google/gemma-4-E4B-it |
| LoRA (SFT) | pavelfedortsov/gemma4-e4b-lora-colloquial-ru |
| This repo | LoRA merged into base + vLLM fixes (k_norm, processor configs) |
Merge was done with peft.merge_and_unload(); missing language_model k_norm weights for layers 24–41 were copied from the base checkpoint (required for vLLM).
Training data
- 50,000 SFT pairs, mat-free colloquial style
- Hub dataset: pavelfedortsov/russian-colloquial-sft-50k
- Built from kurumikz/telegram-corpus-russian-kazakh + Gemini pair generation (see dataset card)
User prompt template (training & inference):
Перепиши простым разговорным русским, как в переписке. Без мата и грубости. Сохрани смысл:
<формальный текст>
Training configuration (LoRA → merge)
Config file (also in card_assets/train_colloquial_e4b_gpu.yaml):
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-E4B-it |
| Method | LoRA on language tower (model.language_model.*) |
| LoRA rank / alpha | 32 / 64 |
| Target modules | q,k,v,o + MLP (gate, up, down) |
| Dataset | 50k × 1 repeat |
| Epochs | 2 (12,500 optimizer steps) |
| Seq length | 512 |
| Batch | 1 × grad accum 8 (effective 8) |
| LR | 1e-4, cosine, warmup 3% |
| Precision | bf16, gradient checkpointing |
| Loss | assistant-only |
| Hardware | RunPod A100 80GB |
Training metrics (LoRA run)
| Metric | Start (step ~25) | End (step 12,500) | Best |
|---|---|---|---|
loss |
~3.42 | ~0.81 | ~0.67 |
mean_token_accuracy |
~0.63 | ~0.82 | ~0.84 |
Checkpoints saved every 1000 steps under the LoRA adapter repo.
Inference
RunPod Serverless (vLLM)
MODEL_NAME=pavelfedortsov/gemma4-e4b-colloquial-ru-merged
HF_TOKEN=<your_token>
TRUST_REMOTE_CODE=true
DTYPE=bfloat16
MAX_MODEL_LEN=4096
GPU_MEMORY_UTILIZATION=0.90
ENFORCE_EAGER=true
ENABLE_LORA=false
LANGUAGE_MODEL_ONLY=true
LIMIT_MM_PER_PROMPT={"image":0,"audio":0,"video":0}
Recommended GPU: ≥40 GB VRAM (merged ~32 GB weights in bf16).
Transformers (local)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "pavelfedortsov/gemma4-e4b-colloquial-ru-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
formal = "Сегодня на совещании обсуждали внедрение новой версии API."
user = (
"Перепиши простым разговорным русским, как в переписке. "
"Без мата и грубости. Сохрани смысл:\n"
f"{formal}"
)
messages = [{"role": "user", "content": user}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
OpenAI-compatible API (RunPod / vLLM)
curl "$RUNPOD_URL/v1/chat/completions" \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "pavelfedortsov/gemma4-e4b-colloquial-ru-merged",
"messages": [{
"role": "user",
"content": "Перепиши простым разговорным русским, как в переписке. Без мата и грубости. Сохрани смысл:\nВаш формальный текст."
}],
"max_tokens": 512,
"temperature": 0.7
}'
Limitations
- Gemma license applies to the base architecture and weights.
- Quality varies on long news-style text; model may shorten or paraphrase aggressively.
- Not safety-tuned for production without your own evaluation.
- Merged vs LoRA inference can differ slightly in style.
Related repos
| Resource | Link |
|---|---|
| LoRA adapter | https://huggingface.co/pavelfedortsov/gemma4-e4b-lora-colloquial-ru |
| Dataset (50k) | https://huggingface.co/datasets/pavelfedortsov/russian-colloquial-sft-50k |
| Base model | https://huggingface.co/google/gemma-4-E4B-it |
- Downloads last month
- 54
