predBor-v0.5
A preview of predBor, a small language model built from the ground up to natively understand Bosnian/Croatian/Serbian while supporting English.
NOTE THAT THIS IS A BASE LANGUAGE MODEL. IT DOES NOT POSSESS THE ABILITY TO CHAT OR ANSWER QUESTIONS, ONLY CONTINUE TEXT.
Architecture Details
- Type: Causal Language Model
- Parameters: 779M
- Architecture: LLaMA
- Context window: 4096 tokens
- Tokenizer: Bor-v2
- Dataset: Bor-CORPUS-22B
This repo contains an undertrained checkpoint of the predBor language model which has only seen 11B tokens of data, yet still shows promising performance on all four supported languages.
Emerging Capabilities
Even undertrained, predBor is starting to display some emerging skills in different fields, only serving to show the quality of its data and architecture. Notable achievements:
- Translation: The model has developed semantic mapping between BCS and English, being able to translate words (and sometimes phrases) with the right prompt.
- Fact memorization: The model has seen enough data to show surprising amounts of general world knowledge, such as viable medical advice, historical facts, recipes with logically correct steps and ingredients, and the capital city to every country there is... for some reason.
- Question answering: When prompted with a question and the beginning of an answer (e.g., "The meaning of life is"), it delivers a somewhat viable answer depending on the topic. The model's current state impacts its overall intelligence, meaning some answers will be hallucinated. This does not represent the final state of the project.
- And more. Feel free to download and test the model yourself.
English Benchmark Comparison (0-shot lm-eval)
predBor-v0.5 (from-scratch, BCS-primary, early v0.5 checkpoint, 11B tokens)
vs
gpt2-orao (GPT-2 Large, finished Serbian model)
| Task | predBor-v0.5 | gpt2-orao |
|---|---|---|
| HellaSwag acc_norm | 34.5% | 26.8% |
| ARC-Challenge acc_norm | 23.8% | 25.9% |
Note: predBor evaluated on RTX 2050, full results JSONs attached.
How to run
Hugging Face Transformers (for quick testing)
pip install transformers torch accelerate
Then,
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "absltnull/predBor-v0.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
prompt = "Glavni grad Bosne i Hercegovine je"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
System requirements
- 3.2 GB storage (full model)
- 2+ GB VRAM/RAM for full bf16/fp16
The model is deliberately released in this raw base state so the community can see the real quality of the training data before any continued pretraining, post-training or alignment is applied.
Feel free to fine-tune. The finished version of predBor will be out soon.
- Downloads last month
- 80
