TURKCELL
/

bert-offensive-lang-detection-tr

 ---
 license: mit
 ---
+ Offensive Language Detection For Turkish Language
+## Model Description
+This model has been fine-tuned using [dbmdz/bert-base-turkish-128k-uncased](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) model with the [OffensEval 2020](https://huggingface.co/datasets/offenseval2020_tr) dataset.
+The offenseval-tr dataset contains 31,756 annotated tweets.
+## Dataset Distribution
+|           | Non Offensive(0) | Offensive (1)|
+|-----------|------------------|--------------|
+| Train     | 25625            | 6131         |
+| Test      | 2812             | 716          |
+## Preprocessing Steps
+| Process                                          | Description                                       |
+|--------------------------------------------------|---------------------------------------------------|
+| Accented character transformation                | Converting accented characters to their unaccented equivalents |
+| Lowercase transformation                         | Converting all text to lowercase                  |
+| Removing @user mentions                          | Removing @user formatted user mentions from text  |
+| Removing hashtag expressions                     | Removing #hashtag formatted expressions from text |
+| Removing URLs                                    | Removing URLs from text                           |
+| Removing punctuation and punctuated emojis       | Removing punctuation marks and emojis presented with punctuation from text |
+| Removing emojis                                  | Removing emojis from text                         |
+| Deasciification                                  | Converting ASCII text into text containing Turkish characters |
+The performance of each pre-process was analyzed.
+Removing digits and keeping hashtags had no effect.
+## Usage
+Install necessary libraries:
+```pip install git+https://github.com/emres/turkish-deasciifier.git```
+```pip install keras_preprocessing```
+Pre-processing functions are below:
+```python
+from turkish.deasciifier import Deasciifier
+def deasciifier(text):
+    deasciifier = Deasciifier(text)
+    return deasciifier.convert_to_turkish()
+def remove_circumflex(text):
+    circumflex_map = {
+        'â': 'a',
+        'î': 'i',
+        'û': 'u',
+        'ô': 'o',
+        'Â': 'A',
+        'Î': 'I',
+        'Û': 'U',
+        'Ô': 'O'
+    }
+    return ''.join(circumflex_map.get(c, c) for c in text)
+def turkish_lower(text):
+    turkish_map = {
+        'I': 'ı',
+        'İ': 'i',
+        'Ç': 'ç',
+        'Ş': 'ş',
+        'Ğ': 'ğ',
+        'Ü': 'ü',
+        'Ö': 'ö'
+    }
+    return ''.join(turkish_map.get(c, c).lower() for c in text)
+```
+Clean text using below function:
+```python
+import re
+def clean_text(text):
+    # Metindeki eğik çizgileri kaldırma
+    text = remove_circumflex(text)
+    # Metni küçük harfe dönüştürme
+    text = turkish_lower(text)
+    # deasciifier
+    text = deasciifier(text)
+    # Kullanıcı adlarını kaldırma
+    text = re.sub(r"@\S*", " ", text)
+    # Hashtag'leri kaldırma
+    text = re.sub(r'#\S+', ' ', text)
+    # URL'leri kaldırma
+    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
+    # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
+    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
+    # Emojileri kaldırma
+    emoji_pattern = re.compile("["
+                           u"\U0001F600-\U0001F64F"  # emoticons
+                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
+                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
+                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
+                           u"\U00002702-\U000027B0"
+                           u"\U000024C2-\U0001F251"
+                           "]+", flags=re.UNICODE)
+    text = emoji_pattern.sub(r' ', text)
+    # Birden fazla boşluğu tek boşlukla değiştirme
+    text = re.sub(r'\s+', ' ', text).strip()
+    return example
+```
+## Model Initialization
+```python
+# Load model directly
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
+model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
+```
+Check if sentence is offensive like below:
+```python
+import numpy as np
+def is_offensive(sentence):
+    d = {
+        0: 'non-offensive',
+        1: 'offensive'
+    }
+    normalize_text = clean_text(sentence)
+    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')
+    test_sample = {k: v.to(device) for k, v in test_sample.items()}
+    output = model(**test_sample)
+    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)
+    print(normalize_text, "-->", d[y_pred[0]])
+    return y_pred[0]
+```
+```python
+is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
+is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")
+```
+## Evaluation
+Evaluation results on test set shown on table below.
+We achive %89 accuracy on test set.
+## Model Performance Metrics
+| Class   | Precision | Recall | F1-score | Accuracy |
+|---------|-----------|--------|----------|----------|
+| Class 0 | 0.92      | 0.94   | 0.93     | 0.89     |
+| Class 1 | 0.73      | 0.67   | 0.70     |          |
+| Macro   | 0.83      | 0.80   | 0.81     |          |