Russian Jokes Transformer Model

A model for generating Russian jokes based on a modified Transformer architecture.

Model Features

  1. Specialization: trained on a dataset of Russian jokes (135k examples)
  2. Tokenization: Byte-Level BPE with a vocabulary size of 1024
  3. Architecture Features:
    • ALiBi (Attention with Linear Biases) for positional encoding
    • GQA (Grouped-Query Attention)
    • SwiGLU in FFN layers
    • RMSNorm instead of LayerNorm
  4. Configurations:
    • Nano (3 layers, 4 heads, 96 hidden)
    • Mini (6 layers, 6 heads, 384 hidden)
    • Small (12 layers, 12 heads, 768 hidden)

Technical Specifications

  • Context Window: 128 tokens
  • Special Tokens: [EOS] for sequence end
  • Average Token Length: ~70 per example
  • Regularization: Dropout 0.1
  • Optimizer: AdamW with weight decay 0.01
  • Training: 10k steps with linear warmup

Usage


REPO_NAME = 'bikmish/llm-course-hw1'
device = torch.device("cuda")

tokenizer = ByteLevelBPETokenizer.from_pretrained(REPO_NAME)
check_model = TransformerForCausalLM.from_pretrained(REPO_NAME)
check_model = check_model.to(device)
check_model = check_model.eval()

text = "Штирлиц пришел домой"
input_ids = torch.tensor(tokenizer.encode(text), device=device)
model_output = check_model.generate(
    input_ids[None, :], max_new_tokens=200, eos_token_id=tokenizer.eos_token_id, do_sample=True, top_k=10
)
tokenizer.decode(model_output[0].tolist())

Example of output (разрыв всего)

Штирлиц пришел домой с работы, приехал.
Преподаватель к себе и вижу: - Давай зайдем сегодня на работу!
- А как ты думаешь, что мы тебя не пьем?
- Дык нет.
- А ты что, тогда находишься?
- А ты не знаешь - кто?
- Дверь откроется!
Downloads last month
0
Safetensors
Model size
79.7M params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train bikmish/llm-course-hw1