absltnull/predBor-v0.5 · Hugging Face

predBor-v0.5

A preview of predBor, a small language model built from the ground up to natively understand Bosnian/Croatian/Serbian while supporting English.

NOTE THAT THIS IS A BASE LANGUAGE MODEL. IT DOES NOT POSSESS THE ABILITY TO CHAT OR ANSWER QUESTIONS, ONLY CONTINUE TEXT.

Architecture Details

Type: Causal Language Model
Parameters: 779M
Architecture: LLaMA
Context window: 4096 tokens
Tokenizer: Bor-v2
Dataset: Bor-CORPUS-22B

This repo contains an undertrained checkpoint of the predBor language model which has only seen 11B tokens of data, yet still shows promising performance on all four supported languages.

Emerging Capabilities

Even undertrained, predBor is starting to display some emerging skills in different fields, only serving to show the quality of its data and architecture. Notable achievements:

Translation: The model has developed semantic mapping between BCS and English, being able to translate words (and sometimes phrases) with the right prompt.
Fact memorization: The model has seen enough data to show surprising amounts of general world knowledge, such as viable medical advice, historical facts, recipes with logically correct steps and ingredients, and the capital city to every country there is... for some reason.
Question answering: When prompted with a question and the beginning of an answer (e.g., "The meaning of life is"), it delivers a somewhat viable answer depending on the topic. The model's current state impacts its overall intelligence, meaning some answers will be hallucinated. This does not represent the final state of the project.
And more. Feel free to download and test the model yourself.

English Benchmark Comparison (0-shot lm-eval)

predBor-v0.5 (from-scratch, BCS-primary, early v0.5 checkpoint, 11B tokens)
vs
gpt2-orao (GPT-2 Large, finished Serbian model)

Task	predBor-v0.5	gpt2-orao
HellaSwag acc_norm	34.5%	26.8%
ARC-Challenge acc_norm	23.8%	25.9%

Note: predBor evaluated on RTX 2050, full results JSONs attached.

How to run

Hugging Face Transformers (for quick testing)

pip install transformers torch accelerate

Then,

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "absltnull/predBor-v0.5"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

prompt = "Glavni grad Bosne i Hercegovine je"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

System requirements

3.2 GB storage (full model)
2+ GB VRAM/RAM for full bf16/fp16

The model is deliberately released in this raw base state so the community can see the real quality of the training data before any continued pretraining, post-training or alignment is applied.

Feel free to fine-tune. The finished version of predBor will be out soon.

Downloads last month: 80

Safetensors

Model size

0.8B params

Tensor type

F32