maanka2/somali-web-corpus
Viewer โข Updated โข 218k โข 49
somgpt-base is a Somali causal language model continued from google/gemma-3-270m and trained on maanka2/somali-web-corpus.
This model was further pre-trained on Somali web text to improve its understanding of Somali vocabulary, grammar, spelling, and writing patterns.
somgpt is a base language model designed for text continuation and language modeling. It is not instruction-tuned and is not optimized for chat, question answering, or assistant-style interactions.
For conversational AI or task-specific applications, additional supervised fine-tuning (SFT) or instruction tuning is recommended.
Training was performed using maanka2/somali-web-corpus, a collection of cleaned Somali-language web content gathered from various online sources.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "maanka2/somgpt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Soomaaliya waa dal ku yaal"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Base model
google/gemma-3-270m