metadata
library_name: transformers
datasets:
- Svngoku/xP3x-Kongo
language:
- kg
metrics:
- bleu
pipeline_tag: text-generation
tags:
- africa
- languages
Kongo Llama Experiment
Model Details
Tokenizer
from transformers import PreTrainedTokenizerFast
# Assuming your custom tokenizer is `tokenizer`
wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
bos_token="[BOS]", # Replace with your special tokens
eos_token="[EOS]", # Replace with your special tokens
unk_token="[UNK]",
pad_token="[PAD]"
)
# Ensure padding is applied to the right side (used in causal language modeling)
wrapped_tokenizer.padding_side = "right"
Model
from transformers import LlamaConfig, LlamaForCausalLM
config = LlamaConfig(
vocab_size=len(wrapped_tokenizer), # Get vocab size from the wrapped tokenizer
hidden_size=512, # Adjust model size as needed
intermediate_size=1024,
num_hidden_layers=8, # Set number of layers and heads
num_attention_heads=8,
max_position_embeddings=512,
rms_norm_eps=1e-6,
initializer_range=0.02,
use_cache=True,
pad_token_id=wrapped_tokenizer.pad_token_id,
bos_token_id=wrapped_tokenizer.bos_token_id,
eos_token_id=wrapped_tokenizer.eos_token_id,
)
model = LlamaForCausalLM(config)
Trainer
from transformers import TrainingArguments, Trainer
# Define training arguments
training_args = TrainingArguments(
output_dir="kongo-llama", # Output directory for model and checkpoints
num_train_epochs=1,
per_device_train_batch_size=8,
learning_rate=5e-5,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
save_steps=1000,
)
trainer = Trainer(
model=model, # Your model instance
args=training_args, # Training arguments
train_dataset=dataset, # Tokenized dataset with input_ids and labels
tokenizer=wrapped_tokenizer, # Wrapped tokenizer
data_collator=data_collator, # Data collator for causal language modeling
)
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: [More Information Needed]
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
# Load the model
model = LlamaForCausalLM.from_pretrained('/content/kongo-llama/checkpoint-9000')
# Prepare input text
text = "Nzambi "
inputs = wrapped_tokenizer(text, return_tensors="pt")
# Generate text
generated_ids = model.generate(
max_length=150, # Increased length
num_beams=5, # Use beam search
temperature=0.7, # Adjust temperature for creativity
do_sample=True,
top_k=50, # Limit vocabulary for next token
top_p=0.95 # Nucleus sampling
)
# Decode and print the generated text
generated_text = wrapped_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)