BERT-based Organization Detection Model for Turkish Texts
Model Description
This model is fine-tuned on the dbmdz/bert-base-turkish-uncased
architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.
Model Architecture
- Base Model: BERT (dbmdz/bert-base-turkish-uncased)
- Training Data: Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4.
Training Setup
- Tokenization: Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
- Dataset Split: 80% training, 20% validation.
- Training Parameters:
- Epochs: 3
- Training batch size: 8
- Evaluation batch size: 16
- Warmup steps: 500
- Weight decay: 0.01
Hyperparameter Tuning
Performed using Optuna, achieving best settings:
- Learning rate: 1.84e-05
- Batch size: 16
- Epochs: 3
Evaluation Metrics
- Precision on Validation Set: 0.67 (organization class)
- Recall on Validation Set: 0.81 (organization class)
- F1-Score (Macro Average): 0.73
- Accuracy: 0.94
- Confusion Matrix on Validation Set:
[[1390, 60],
[ 28, 122]]
- Hand-coded Sample of 1000 Accounts:
- Precision: 0.89
- Recall: 0.89
- F1-Score (Macro Average): 0.89
- Confusion Matrix:
[[935, 4], [ 4, 31]]
How to Use
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier")
tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier")
text = "Örnek metin buraya girilir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
- Downloads last month
- 9
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.