Text Generation
Transformers
PyTorch
Polish
English
llama
Inference Endpoints
text-generation-inference
Edit model card

image/png

polka-1.1b

polka-1.1b takes the TinyLlama-1.1B model and enhances it by continuing pretraining on an additional 5.7 billion Polish tokens, primarily sourced from the MADLAD-400 dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using DSIR. Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.

The training took 680 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.

Context size: 2,048 tokens.

Notes

This base model was initially developed as the foundation for instruction tuning, which resulted in polka-1.1b-chat. Nonetheless, I'm sharing it with the community because I see potential value in its combination of relatively good performance and an efficient bilingual tokenizer.

The model is capable of producing coherent Polish text, but due to its size, it is likely to suffer from hallucination.

Sample code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "eryk-mazus/polka-1.1b"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

prompt = """Przykładowe zapytanie do modelu"""

model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
  generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    penalty_alpha=0.6,
    top_k=5
  )

output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
Downloads last month
14
Invalid base_model specified in model card metadata. Needs to be a model id from hf.co/models.

Datasets used to train eryk-mazus/polka-1.1b

Space using eryk-mazus/polka-1.1b 1

Collection including eryk-mazus/polka-1.1b