Turkish Diacritization

The goal of this project is to present and introduce the processing of Turkish language, particularly in social media situations, by investigating the field of diacritization and developing techniques for adding diacritical marks to text in the future.

Path Design

.
├── docs
│   ├── Project Proposal.pdf
├── ner
│   ├── new_df.csv
│   ├── process_data.ipynb
│   ├── named-entity-recognition.ipynb
│   ├── getting_B.py
├── plots
├── tools
│   ├── data_utils.py
├── test
│   ├── test-turkish-t5.ipynb
├── train
│   ├── llm-fine-tune-t5-transformer.ipynb
│   ├── llm-fine-tune.ipynb
├── README.md

Dataset

The original train dataset, You can access the original train dataset from the link.

The test dataset, You can access the test dataset from the link.

We generated negative sentences by using original sentences. Negative sentences are randomly mapping of some letters to another letter. We used this negative sentences to generate augmented dataset. We used this augmented dataset to train our model.

Character Mapping is as follows:

character_mapping = {
    'ı': 'i',
    'i': 'ı',
    'u': 'ü',
    'ü': 'u',
    'o': 'ö',
    'ö': 'o',
    'ç': 'c',
    'c': 'ç',
    'ğ': 'g',
    'g': 'ğ',
    's': 'ş',
    'ş': 's'
}

You can see the dataset from the link the augmented dataset

NER

Why NER?

After the diacritization process, when we look our result we saw that our model does not care about capital letter. So we decided to add additional NER layer to our transormer model. We will use BiLSTM-CRF NER model to detect the named entities and we will use this information to improve our diacritization model.

Dataset

Firstly, we downloaded 2 Kaggle datasets and process them to make appropriate for BiLSTM-CRF NER model. You can find the processed dataset from the link. The processed NER dataset

However, BiLSTM-CRF Model, that we trained, did not work well. We needed to use pretrained BERT Model.

NER Model

Model is Turkish Bert Classication Model which is trained on Turkish NER dataset. We used this model to detect named entities in our text.

Model

In this project we tried two different tasks for transformers. One of them is Casual LM with BERT and other is Seq2Seq with T5. And we decided to continue with T5 model because of the better results.

BERT Model

We used BERT model for casual language modeling. We designed our dataset according to this task and trained a pretrained BERT model. You can find the model in that link. BERT Model

T5 Model

We used T5 model for seq2seq task. We designed our dataset according to this task and trained a pretrained T5 model. You can find the model in that link. T5 Model. Our resulted model for T5 is on kaggle you can download two version of the model from that link. T5 Model.

V1.0

This variation of model half trained with 1 million samples and without missing tokens. So we can say this model works well but there are some issues according to it's result due to missing tokens.

V2.0

This variation of model trained with 2 million samples and with missing tokens. So we can say this model works very good but this model needs NER model to improve it's performance.

Training Arguments

We used the following training arguments for our model.

training_args = transformers.TrainingArguments(
        per_device_train_batch_size=25,
        num_train_epochs=1,
        warmup_steps=50,
        weight_decay=0.01,
        learning_rate=2e-3,
        save_steps=10000, 
        logging_steps=10,
        save_strategy='steps',
        output_dir="/kaggle/working/turkish2",
        lr_scheduler_type="cosine",
    )

Training Plots

The loss and learning rate plots for the T5 model are given below.

Model Evaluation

We evaluated our model with the provided test dataset. While we are testing our model, we also added NER model to our pipeline. We used NER model to detect named entities in our text and we used this information to improve our diacritization model. Evaluation function is given as follow:

def acc_overall(test_result, testgold):

  correct = 0
  total = 0
  # count number of correctly diacritized words
  for i in range(len(testgold)):
    for m in range(len(testgold[i].split())):
      if test_result[i].split()[m] == testgold[i].split()[m]:
        correct += 1
      total +=1

  return correct / total

Our model's accuracy on test dataset is %94.03. We can say that our model works well.

emirhangazi77
/

Turkish-T5

Turkish Diacritization

Path Design

Dataset

NER

Why NER?

Dataset

NER Model

Model

BERT Model

T5 Model

V1.0

V2.0

Training Arguments

Training Plots

Model Evaluation

Space using emirhangazi77/Turkish-T5 1