--- language: "tr" tags: - "bert" - "turkish" - "text-classification" license: "apache-2.0" datasets: - "custom" metrics: - "precision" - "recall" - "f1" - "accuracy" --- # BERT-based Organization Detection Model for Turkish Texts ## Model Description This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data. ## Model Architecture - **Base Model:** BERT (dbmdz/bert-base-turkish-uncased) - **Training Data:** Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4. ## Training Setup - **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens. - **Dataset Split:** 80% training, 20% validation. - **Training Parameters:** - Epochs: 3 - Training batch size: 8 - Evaluation batch size: 16 - Warmup steps: 500 - Weight decay: 0.01 ## Hyperparameter Tuning Performed using Optuna, achieving best settings: - **Learning rate:** 1.84e-05 - **Batch size:** 16 - **Epochs:** 3 ## Evaluation Metrics - **Precision on Validation Set:** 0.67 (organization class) - **Recall on Validation Set:** 0.81 (organization class) - **F1-Score (Macro Average):** 0.73 - **Accuracy:** 0.94 - **Confusion Matrix on Validation Set:** ``` [[1390, 60], [ 28, 122]] ``` - **Hand-coded Sample of 100 Accounts:** - **Precision:** 0.89 - **Recall:** 0.89 - **F1-Score (Macro Average):** 0.89 - **Confusion Matrix:** ``` [[935, 4], [ 4, 31]] ``` ## How to Use ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier") tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier") text = "Örnek metin buraya girilir." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) predictions = outputs.logits.argmax(-1) ```