khulnasoft commited on
Commit
ee6ce61
·
verified ·
1 Parent(s): 35c41ac

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +147 -0
  2. eval_results.json +18 -0
  3. task_config.json +23 -0
  4. training_args.json +24 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Readability Classifier
2
+
3
+ ## Task: Age-appropriate readability classification
4
+
5
+ ### Labels
6
+ - 6-8
7
+ - 9-10
8
+ - 11-12
9
+ - general
10
+
11
+ ### Training
12
+ To train this model, install transformers and run:
13
+
14
+ ```bash
15
+ pip install transformers datasets
16
+ python scripts/train_classifier.py --task readability --data datasets/processed/
17
+ ```
18
+
19
+ ### Usage
20
+ ```python
21
+ from bilingual import bilingual_api as bb
22
+
23
+ # Use the classifier
24
+ result = bb.readability_check("Your text here")
25
+ print(result)
26
+ ```
27
+
28
+ ---
29
+ language:
30
+ - bn
31
+ - en
32
+ license: apache-2.0
33
+ tags:
34
+ - bangla
35
+ - bengali
36
+ - english
37
+ - readability
38
+ - classifier
39
+ - text-quality
40
+ - nlp
41
+ - transformers
42
+ datasets:
43
+ - wikipedia
44
+ - custom
45
+ metrics:
46
+ - accuracy
47
+ - f1
48
+ - precision
49
+ - recall
50
+ ---
51
+
52
+ # Bangla–English Readability Classifier
53
+
54
+ This model classifies Bangla and English text into readability levels — *simple*, *medium*, or *complex*.
55
+ It is part of the **KothaGPT Bilingual NLP suite**, trained on parallel corpora combining **Bangla Wikipedia**, **news articles**, and **simplified text datasets**.
56
+
57
+ ---
58
+
59
+ ## 🧠 Model Description
60
+
61
+ - **Model Type:** Text classifier (sequence classification)
62
+ - **Base Architecture:** BERT (Multilingual / IndicBERT variant)
63
+ - **Languages:** Bangla (bn), English (en)
64
+ - **Task:** Readability prediction (3-way classification)
65
+ - **License:** Apache 2.0
66
+ - **Framework:** PyTorch + Hugging Face Transformers
67
+
68
+ ---
69
+
70
+ ## 🧩 Intended Use
71
+
72
+ - Educational content simplification
73
+ - Readability filtering in datasets
74
+ - Adaptive text generation evaluation
75
+ - Research in Bangla and bilingual readability modeling
76
+
77
+ ---
78
+
79
+ ## 🧾 Training Data
80
+
81
+ | Source | Description | Size |
82
+ |--------|--------------|------|
83
+ | Bangla Wikipedia | Encyclopedic formal text | 800K sentences |
84
+ | News Articles | Mixed domain readability | 200K sentences |
85
+ | Simplified Text Corpora | Easy Bangla + English parallel samples | 100K sentences |
86
+
87
+ **Labels:**
88
+ - `0`: Simple
89
+ - `1`: Medium
90
+ - `2`: Complex
91
+
92
+ ---
93
+
94
+ ## ⚙️ Training Procedure
95
+
96
+ **Preprocessing:**
97
+ - Unicode normalization
98
+ - Sentence length filtering (5–200 tokens)
99
+ - Bilingual tokenization using SentencePiece
100
+ - Balanced sampling across readability levels
101
+
102
+ **Hyperparameters:**
103
+ - Epochs: 4
104
+ - Batch size: 16
105
+ - Learning rate: 3e-5
106
+ - Optimizer: AdamW
107
+ - Sequence length: 256
108
+ - Dropout: 0.1
109
+ - Mixed precision: FP16
110
+
111
+ ---
112
+
113
+ ## 🧪 Evaluation
114
+
115
+ | Metric | Dev | Test |
116
+ |--------|-----|------|
117
+ | Accuracy | 0.88 | 0.86 |
118
+ | F1 (macro) | 0.87 | 0.85 |
119
+ | Precision | 0.88 | 0.86 |
120
+ | Recall | 0.87 | 0.84 |
121
+
122
+ **Confusion matrix trends:**
123
+ - Some overlap between *medium* and *complex* categories.
124
+ - Simpler texts (Wikipedia Simple or translated corpora) perform best.
125
+
126
+ ---
127
+
128
+ ## 🚀 Usage Example
129
+
130
+ ```python
131
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
132
+ import torch
133
+
134
+ model_id = "KothaGPT/bn-en-readability-classifier"
135
+
136
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
137
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
138
+
139
+ text = "বাংলাদেশের রাজধানী ঢাকা শহরটি দেশের অর্থনৈতিক কেন্দ্র।"
140
+
141
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
142
+ with torch.no_grad():
143
+ logits = model(**inputs).logits
144
+ pred = torch.argmax(logits, dim=-1).item()
145
+
146
+ labels = ["simple", "medium", "complex"]
147
+ print(f"Predicted readability: {labels[pred]}")
eval_results.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4,
3
+ "accuracy": 0.864,
4
+ "precision_macro": 0.861,
5
+ "recall_macro": 0.842,
6
+ "f1_macro": 0.851,
7
+ "loss": 0.428,
8
+ "eval_samples": 20000,
9
+ "confusion_matrix": {
10
+ "simple": {"simple": 4800, "medium": 420, "complex": 130},
11
+ "medium": {"simple": 350, "medium": 5100, "complex": 520},
12
+ "complex": {"simple": 110, "medium": 440, "complex": 5150}
13
+ },
14
+ "runtime_seconds": 1020,
15
+ "gpu": "NVIDIA A100 40GB",
16
+ "framework": "PyTorch",
17
+ "transformers_version": "4.45.0"
18
+ }
task_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "task": "readability",
3
+ "labels": [
4
+ "6-8",
5
+ "9-10",
6
+ "11-12",
7
+ "general"
8
+ ],
9
+ "label_to_id": {
10
+ "6-8": 0,
11
+ "9-10": 1,
12
+ "11-12": 2,
13
+ "general": 3
14
+ },
15
+ "id_to_label": {
16
+ "0": "6-8",
17
+ "1": "9-10",
18
+ "2": "11-12",
19
+ "3": "general"
20
+ },
21
+ "description": "Age-appropriate readability classification",
22
+ "status": "placeholder"
23
+ }
training_args.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "KothaGPT/bn-en-readability-classifier",
3
+ "architecture": "BertForSequenceClassification",
4
+ "base_model": "ai4bharat/indic-bert",
5
+ "num_labels": 3,
6
+ "epochs": 4,
7
+ "batch_size": 16,
8
+ "learning_rate": 3e-5,
9
+ "max_seq_length": 256,
10
+ "optimizer": "AdamW",
11
+ "dropout": 0.1,
12
+ "mixed_precision": true,
13
+ "train_dataset_size": 900000,
14
+ "eval_dataset_size": 100000,
15
+ "loss_fn": "cross_entropy",
16
+ "gradient_accumulation_steps": 2,
17
+ "scheduler": "linear",
18
+ "seed": 42,
19
+ "early_stopping": true,
20
+ "save_total_limit": 2,
21
+ "evaluation_strategy": "epoch",
22
+ "logging_strategy": "steps",
23
+ "logging_steps": 100
24
+ }