Model Card for Model ID

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Description

AfriBert is an Afrikaans language model built on a BERT architecture. Trained using Huisgenoot magazine articles, the model is designed to handle a variety of natural language processing tasks in Afrikaans. This model aims to improve text understanding, generation, and analysis for Afrikaans speakers by providing robust linguistic capabilities.

  • Developed by: Mthombeni F
  • Model type: LLM
  • Language(s) (NLP): Python
  • Finetuned from model: Bert Model

Uses

The model is intended for a range of Afrikaans natural language processing (NLP) tasks. Foreseeable users include:

  • Researchers working with low-resource languages, particularly Afrikaans.
  • Businesses and developers creating Afrikaans-based applications, such as chatbots or content generation tools.
  • Educational institutions looking to use the model in Afrikaans language learning programs.
  • Content creators needing to generate or analyze text in Afrikaans.

Those affected by the model include:

  • Afrikaans speakers and communities, particularly those whose content may be analyzed or generated.
  • Users of applications that use this model for text processing.

Direct Use

AfriBert can be used out-of-the-box for tasks like:

  • Text generation, completing or generating Afrikaans text based on a prompt.
  • Question-answering (QA) in Afrikaans, where users provide a context and the model can answer questions based on the provided text.
  • Text classification or sentiment analysis without needing fine-tuning. Direct users of AfriBert can leverage it for conversational agents, chatbots, or even creative writing prompts.

Downstream Use [optional]

When fine-tuned for specific tasks, AfriBert can be integrated into larger ecosystems or applications. Some potential downstream uses include:

  • Fine-tuning for sentiment analysis in Afrikaans for social media or news data.
  • Integration into translation systems for handling Afrikaans-specific linguistic nuances.
  • Customization for particular domains (legal, medical) where fine-tuned vocabulary is necessary. In larger ecosystems, AfriBert can function as the backbone for Afrikaans NLP applications, such as intelligent assistants or educational platforms.

Out-of-Scope Use

AfriBert is not designed for, nor will it work well in, the following contexts:

  • Highly technical fields (e.g., medical, legal) without additional fine-tuning.
  • Use for misinformation, propaganda, or the generation of harmful or biased content.
  • In multi-lingual tasks beyond Afrikaans, as the model is specifically tuned for this language and may not generalize well to other South African languages.
  • Applications involving critical safety tasks or decision-making processes, such as autonomous systems. Misuse of the model could lead to the propagation of biases found in the training data from Huisgenoot or generate inappropriate or out-of-context content.

Bias, Risks, and Limitations

While AfriBert, performs well on tasks related to Afrikaans language processing, it has certain limitations:

  • It may struggle with highly technical or domain-specific vocabulary not present in the training data.
  • The model may reflect biases present in the training data from Huisgenoot.
  • Performance on Afrikaans dialects or less common language forms may not be optimal.

Recommendations

  • Bias Awareness: AfriBert may inherit biases present in the training corpus, which could affect how it generates or interprets Afrikaans text. Users should be aware of this, particularly when deploying the model in sensitive or public-facing environments.
  • Evaluation: The model should be evaluated carefully before deployment to ensure it aligns with the specific goals of the intended use case, especially in high-stakes scenarios.
  • Performance Monitoring: AfriBert may perform inconsistently across different Afrikaans dialects or informal language. Continuous evaluation and monitoring are recommended for such scenarios.

How to Get Started with the Model

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import pipeline
# Load the tokenizer and model
model_path = "{path-to-your-afribert-model}"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)
# Initialize the text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Generate Afrikaans text
prompt = "Ek het 'n droom gehad"
generated_text = text_generator(prompt, max_length=100, num_return_sequences=1)
print(generated_text[0]['generated_text'])

Training Details

AfriBert was trained on a corpus of Afrikaans text derived from Huisgenoot magazine articles, a popular publication that covers a variety of topics. The dataset was preprocessed to clean and tokenize the text, removing unnecessary characters, stopwords, and normalizing for consistency. The model was fine-tuned on top of the GPT-2 architecture for tasks such as text generation and was evaluated on several text classification and question-answering tasks.

Training Data

The model was trained on a large corpus of articles from Huisgenoot, a popular Afrikaans magazine, covering a wide range of topics from culture, entertainment, and human interest stories. This data was selected to provide a comprehensive understanding of colloquial and formal Afrikaans.

Training Procedure

The training procedure included:

  • Loading and preprocessing the Huisgenoot dataset.
  • Tokenizing the corpus using GPT-2 tokenization.
  • Fine-tuning the GPT-2 model using a training regime that involved text generation and sequence prediction.
  • The model was trained over several epochs (typically 3) with a batch size of 4 per device, using standard optimization techniques.-->

Preprocessing [optional]

The text from Huisgenoot was preprocessed to remove special characters, numbers, and unnecessary whitespace. Additionally:

  • Text was normalized to lowercase.
  • Afrikaans stop words were removed to improve the quality of the text corpus.
  • Tokenization was performed using the GPT-2 tokenizer.

Training Hyperparameters

  • Training regime: fp32 precision.
  • Epochs: 3
  • Batch size: 4 per device
  • Learning rate: Default optimizer learning rate was used.
  • Optimizer: AdamW.

Evaluation

Evaluation was done using standard text generation, question-answering, and classification benchmarks in Afrikaans. The model's performance was compared to baseline models for accuracy, fluency, and coherence in generated text.

Testing Data, Factors & Metrics

Testing Data

Factors

Factors considered during evaluation include:

  • Performance on formal vs. informal Afrikaans text.
  • Handling of dialects or regional language differences.
  • Domain-specific performance for common Afrikaans topics.

Metrics

Evaluation metrics for AfriBert include:

  • Perplexity: Measures the fluency of text generation.
  • Accuracy: For question-answering tasks.
  • F1 Score: For named entity recognition (NER) and classification tasks. These metrics provide insight into how well the model performs across different language tasks

Model Card Authors [optional]

FuluM21

Model Card Contact

Downloads last month
6
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.