TURKCELL/bert-offensive-lang-detection-tr

Offensive Language Detection For Turkish Language

Model Description

This model has been fine-tuned using dbmdz/bert-base-turkish-128k-uncased model with the OffensEval 2020 dataset. The offenseval-tr dataset contains 31,756 annotated tweets.

Dataset Distribution

	Non Offensive(0)	Offensive (1)
Train	25625	6131
Test	2812	716

Preprocessing Steps

Process	Description
Accented character transformation	Converting accented characters to their unaccented equivalents
Lowercase transformation	Converting all text to lowercase
Removing @user mentions	Removing @user formatted user mentions from text
Removing hashtag expressions	Removing #hashtag formatted expressions from text
Removing URLs	Removing URLs from text
Removing punctuation and punctuated emojis	Removing punctuation marks and emojis presented with punctuation from text
Removing emojis	Removing emojis from text
Deasciification	Converting ASCII text into text containing Turkish characters

The performance of each pre-process was analyzed. Removing digits and keeping hashtags had no effect.

Usage

Install necessary libraries:

pip install git+https://github.com/emres/turkish-deasciifier.git

pip install keras_preprocessing

Pre-processing functions are below:


from turkish.deasciifier import Deasciifier
def deasciifier(text):
    deasciifier = Deasciifier(text)
    return deasciifier.convert_to_turkish()

def remove_circumflex(text):
    circumflex_map = {
        'â': 'a',
        'î': 'i',
        'û': 'u',
        'ô': 'o',
        'Â': 'A',
        'Î': 'I',
        'Û': 'U',
        'Ô': 'O'
    }

    return ''.join(circumflex_map.get(c, c) for c in text)    
def turkish_lower(text):
    turkish_map = {
        'I': 'ı',
        'İ': 'i',
        'Ç': 'ç',
        'Ş': 'ş',
        'Ğ': 'ğ',
        'Ü': 'ü',
        'Ö': 'ö'
    }
    return ''.join(turkish_map.get(c, c).lower() for c in text)

Clean text using below function:

import re

def clean_text(text):
    # Metindeki şapkalı harfleri kaldırma
    text = remove_circumflex(text)
    # Metni küçük harfe dönüştürme
    text = turkish_lower(text)
    # deasciifier
    text = deasciifier(text)
    # Kullanıcı adlarını kaldırma
    text = re.sub(r"@\S*", " ", text)
    # Hashtag'leri kaldırma
    text = re.sub(r'#\S+', ' ', text)
    # URL'leri kaldırma
    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
    # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
    # Emojileri kaldırma
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)

    # Birden fazla boşluğu tek boşlukla değiştirme
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Model Initialization

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

Check if sentence is offensive like below:

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

Evaluation

Evaluation results on test set shown on table below. We achive %89 accuracy on test set.

Model Performance Metrics

Class	Precision	Recall	F1-score	Accuracy
Class 0	0.92	0.94	0.93	0.89
Class 1	0.73	0.67	0.70
Macro	0.83	0.80	0.81

TURKCELL
/

bert-offensive-lang-detection-tr