Fine Tuned HuggingFace Language Identification Model

Languages Supported:

  1. English (en)
  2. French (fr)
  3. German (de)
  4. Russian (ru)
  5. Arabc (ar)

Metrics:

  1. f1 - score
  2. Accuracy
  3. Precision
  4. Recall

Library_name:

Transformers

Overview

Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.

Table of Contents

  1. Model Details
  2. Training
  3. Corpus Used
  4. Technology Stack
  5. Model Performance
  6. Usage
  7. Project File Structure
  8. Contributing

1. Model Details

  1. Model Architecture: The model architecture is based on XLM-RoBERTa, a multilingual variant of RoBERTa. This architecture is renowned for its contextual embeddings and multilingual capabilities.

  2. Number of Languages: The model is designed to identify text in five different languages, specifically English, French, German, Arabic, and Russian.

  3. Training Dataset: The training dataset is extensive, comprising a diverse range of text samples from the target languages. This diversity enables the model to generalize effectively and achieve superior accuracy.

  4. Evaluation Metric: The primary evaluation metrics used are accuracy and F1-score. These metrics provide insights into the model's overall classification performance.

2. Training

The model underwent a rigorous fine-tuning process using Hugging Face's Trainer class. This section delves into the intricacies of the training process:

  1. Number of Epochs: The model was trained over the course of two epochs. This balance between training time and performance optimization allows it to reach its full potential.

  2. Learning Rate: A learning rate of 2e-5 was selected. This value strikes a balance between rapid convergence and fine-tuned accuracy.

  3. Batch Size: A batch size of 64 was used during training. This size was chosen to ensure effective memory usage and minimize computational overhead.

  4. Evaluation Strategy: The evaluation strategy is set to be epoch-based, ensuring that the model is periodically assessed for performance improvements.

  5. Logging Steps: The logging steps are determined based on the size of the training dataset. This dynamic approach adapts to dataset variations, providing more informative logs during training.

3. Corpus Used

The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.

Language Size of Corpus (in number of sentences)
English 50002
French 50002
German 50002
Russian 50002
Arabic 36888

4. Technology Stack

  1. Python: Python is the primary programming language used for developing the language identification model and its associated tools. Python's simplicity, readability, and extensive libraries make it an ideal choice for Natural Language Processing (NLP) tasks.

  2. Hugging Face Transformers: Hugging Face Transformers is a fundamental component of the technology stack. It provides access to pre-trained models, libraries for model fine-tuning, and tokenization tools. The project relies heavily on this open-source library for model loading, fine-tuning, and evaluation.

  3. PyTorch: PyTorch is the deep learning framework chosen for this project. It provides the computational backbone for neural network training and inference. PyTorch's flexibility and dynamic computation graph make it a popular choice for developing NLP models.

  4. Transformers Library: The Transformers library within Hugging Face provides high-level abstractions for training and fine-tuning transformer models. It offers the Trainer class, which simplifies training, evaluation, and model saving.

  5. Datasets Library: The Datasets library is another valuable component from Hugging Face. It simplifies data handling, data loading, and data preprocessing, making it easy to work with large datasets efficiently.

  6. XLM-RoBERTa: XLM-RoBERTa serves as the base architecture for the language identification model. It is a multilingual variant of the RoBERTa model, which is pre-trained on a vast corpus of text from multiple languages. This architecture provides the foundation for the fine-tuning process.

  7. scikit-learn: scikit-learn is used for calculating evaluation metrics such as accuracy, F1-score, precision, and recall. It offers a wide range of machine learning tools, making it suitable for assessing model performance.

  8. Git and GitHub: Git and GitHub are used for version control and collaborative development. Git helps manage the codebase and track changes, while GitHub provides a platform for collaborative work and model distribution.

  9. Hugging Face Model Hub: The Hugging Face Model Hub is the platform where the fine-tuned language identification model is hosted. It allows users to easily access and utilize the model in their NLP projects.

  10. Google Collab Notebooks: Jupyter Notebooks were used for exploratory data analysis, code prototyping, and interactive documentation. They offer a convenient environment for experimenting with code and data visualization.

5. Model Performance

5.1 Overall Performance

Accuracy F1-Score
0.9996 0.9996

5.2 Language Wise Performance

Language Precision Recall F1 - Score Accuracy
English 1.0000 0.9994 0.9997 0.9994
French 1.0000 0.9992 0.9996 0.9992
German 1.0000 0.9998 0.9999 0.9998
Arabic 1.0000 0.9997 0.9999 0.9997
Russian 1.0000 1.0000 1.0000 1.0000

6. Usage

To use this model for language identification, you can follow these steps:

  1. Install the necessary libraries and dependencies.
  2. Load the pre-trained model using the provided model checkpoint.
  3. Tokenize the input text using the model's tokenizer.
  4. Make predictions on the tokenized input to identify the language.

6.1 Installation

To utilize this language identification model, you must install the transformers library and other essential dependencies.

pip install transformers datasets

6.2 Loading the Model

The model can be effortlessly loaded using the Hugging Face Transformers library. The following code demonstrates how to load the model and tokenizer:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_ckpt = "Fine_Tuned_HF_Language_Identification_Model "
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)

6.3 Language Identification

Identifying the language of a given text is a straightforward process. The model utilizes a pre-trained tokenizer to prepare the text and a fine-tuned model to make the prediction. Here's a code snippet:

text = "Your input text goes here"
inputs = tokenizer(text, truncation=True, max_length=128, return_tensors="pt")
with torch.no_grad():
  outputs = model(**inputs)
predicted_language = model.config.id2label[torch.argmax(outputs.logits)]

6.4 Evaluating the Model

The model's performance can be evaluated on your dataset using the provided evaluation script. It calculates accuracy and F1-score, giving you a comprehensive understanding of how well the model classifies text.

eval_result = trainer.evaluate(eval_dataset=tok_test)
accuracy = eval_result["eval_accuracy"]
f1 = eval_result["eval_f1"]

7. Project Files Structure

The project's files structure is organized as follows:

  • data/ : Contains datasets used for training and testing the model

  • src/ : Source code and Google Collab Notebook

  • README.md : This README file

  • / : Model checkpoint files

8. Contributing

Contributions and suggestions are welcome. If you find issues or have ideas for improvements, please open an issue or submit a pull request.

Downloads last month
13