Rahilgh commited on
Commit
31de934
·
verified ·
1 Parent(s): 05ec54a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - fr
5
+ license: mit
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - misinformation-detection
9
+ - fake-news
10
+ - text-classification
11
+ - algerian-darija
12
+ - arabic
13
+ - mbert
14
+ model_name: mBERT-Algerian-Darija
15
+ base_model: bert-base-multilingual-cased
16
+ ---
17
+
18
+ # mBERT — Algerian Darija Misinformation Detection
19
+
20
+ Fine-tuned **BERT-base-multilingual-cased** for detecting misinformation in **Algerian Darija** text.
21
+
22
+ - **Base model**: `bert-base-multilingual-cased` (170M parameters)
23
+ - **Task**: Multi-class text classification (5 classes)
24
+ - **Classes**: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire)
25
+
26
+ ---
27
+
28
+ ## Performance (Test set: 3,344 samples)
29
+
30
+ - **Accuracy**: 75.42%
31
+ - **Macro F1**: 64.48%
32
+ - **Weighted F1**: 75.70%
33
+
34
+ **Per-class F1**:
35
+ - Factual (F): 83.72%
36
+ - Reporting (R): 76.35%
37
+ - Non-factual (N): 81.01%
38
+ - Misleading (M): 61.46%
39
+ - Satire (S): 19.86%
40
+
41
+
42
+ ---
43
+
44
+ ## Training Summary
45
+
46
+ - **Max sequence length**: 128
47
+ - **Epochs**: 3 (early stopping)
48
+ - **Batch size**: 16
49
+ - **Learning rate**: 2e-5
50
+ - **Loss**: Weighted CrossEntropy
51
+ - **Seed**: 42 (reproducibility)
52
+
53
+ ---
54
+
55
+ ## Usage
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
60
+
61
+ MODEL_ID = "Rahilgh/model4_1"
62
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
63
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
64
+
65
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
66
+ model.to(device).eval()
67
+
68
+ LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
69
+ LABEL_NAMES = {
70
+ "F": "Factual",
71
+ "R": "Reporting",
72
+ "N": "Non-factual",
73
+ "M": "Misleading",
74
+ "S": "Satire"
75
+ }
76
+
77
+ texts = [
78
+ "قالك بلي رايحين ينحو الباك هذا العام",
79
+
80
+ ]
81
+
82
+ for text in texts:
83
+ inputs = tokenizer(
84
+ text,
85
+ return_tensors="pt",
86
+ max_length=128,
87
+ truncation=True,
88
+ padding=True,
89
+ ).to(device)
90
+
91
+ with torch.no_grad():
92
+ outputs = model(**inputs)
93
+ probs = torch.softmax(outputs.logits, dim=1)[0]
94
+ pred_id = probs.argmax().item()
95
+ confidence = probs[pred_id].item()
96
+
97
+ label = LABEL_MAP[pred_id]
98
+ print(f"Text: {text}")
99
+ print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n")