khulnasoft commited on
Commit
da2cb13
·
verified ·
1 Parent(s): ee6ce61

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +0 -121
  2. config.json +35 -0
  3. task_config.json +1 -1
  4. training_args.json +2 -2
README.md CHANGED
@@ -24,124 +24,3 @@ from bilingual import bilingual_api as bb
24
  result = bb.readability_check("Your text here")
25
  print(result)
26
  ```
27
-
28
- ---
29
- language:
30
- - bn
31
- - en
32
- license: apache-2.0
33
- tags:
34
- - bangla
35
- - bengali
36
- - english
37
- - readability
38
- - classifier
39
- - text-quality
40
- - nlp
41
- - transformers
42
- datasets:
43
- - wikipedia
44
- - custom
45
- metrics:
46
- - accuracy
47
- - f1
48
- - precision
49
- - recall
50
- ---
51
-
52
- # Bangla–English Readability Classifier
53
-
54
- This model classifies Bangla and English text into readability levels — *simple*, *medium*, or *complex*.
55
- It is part of the **KothaGPT Bilingual NLP suite**, trained on parallel corpora combining **Bangla Wikipedia**, **news articles**, and **simplified text datasets**.
56
-
57
- ---
58
-
59
- ## 🧠 Model Description
60
-
61
- - **Model Type:** Text classifier (sequence classification)
62
- - **Base Architecture:** BERT (Multilingual / IndicBERT variant)
63
- - **Languages:** Bangla (bn), English (en)
64
- - **Task:** Readability prediction (3-way classification)
65
- - **License:** Apache 2.0
66
- - **Framework:** PyTorch + Hugging Face Transformers
67
-
68
- ---
69
-
70
- ## 🧩 Intended Use
71
-
72
- - Educational content simplification
73
- - Readability filtering in datasets
74
- - Adaptive text generation evaluation
75
- - Research in Bangla and bilingual readability modeling
76
-
77
- ---
78
-
79
- ## 🧾 Training Data
80
-
81
- | Source | Description | Size |
82
- |--------|--------------|------|
83
- | Bangla Wikipedia | Encyclopedic formal text | 800K sentences |
84
- | News Articles | Mixed domain readability | 200K sentences |
85
- | Simplified Text Corpora | Easy Bangla + English parallel samples | 100K sentences |
86
-
87
- **Labels:**
88
- - `0`: Simple
89
- - `1`: Medium
90
- - `2`: Complex
91
-
92
- ---
93
-
94
- ## ⚙️ Training Procedure
95
-
96
- **Preprocessing:**
97
- - Unicode normalization
98
- - Sentence length filtering (5–200 tokens)
99
- - Bilingual tokenization using SentencePiece
100
- - Balanced sampling across readability levels
101
-
102
- **Hyperparameters:**
103
- - Epochs: 4
104
- - Batch size: 16
105
- - Learning rate: 3e-5
106
- - Optimizer: AdamW
107
- - Sequence length: 256
108
- - Dropout: 0.1
109
- - Mixed precision: FP16
110
-
111
- ---
112
-
113
- ## 🧪 Evaluation
114
-
115
- | Metric | Dev | Test |
116
- |--------|-----|------|
117
- | Accuracy | 0.88 | 0.86 |
118
- | F1 (macro) | 0.87 | 0.85 |
119
- | Precision | 0.88 | 0.86 |
120
- | Recall | 0.87 | 0.84 |
121
-
122
- **Confusion matrix trends:**
123
- - Some overlap between *medium* and *complex* categories.
124
- - Simpler texts (Wikipedia Simple or translated corpora) perform best.
125
-
126
- ---
127
-
128
- ## 🚀 Usage Example
129
-
130
- ```python
131
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
132
- import torch
133
-
134
- model_id = "KothaGPT/bn-en-readability-classifier"
135
-
136
- tokenizer = AutoTokenizer.from_pretrained(model_id)
137
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
138
-
139
- text = "বাংলাদেশের রাজধানী ঢাকা শহরটি দেশের অর্থনৈতিক কেন্দ্র।"
140
-
141
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
142
- with torch.no_grad():
143
- logits = model(**inputs).logits
144
- pred = torch.argmax(logits, dim=-1).item()
145
-
146
- labels = ["simple", "medium", "complex"]
147
- print(f"Predicted readability: {labels[pred]}")
 
24
  result = bb.readability_check("Your text here")
25
  print(result)
26
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "bert",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "task_type": "text-classification",
7
+ "num_labels": 4,
8
+ "label2id": {
9
+ "6-8": 0,
10
+ "9-10": 1,
11
+ "11-12": 2,
12
+ "general": 3
13
+ },
14
+ "id2label": {
15
+ "0": "6-8",
16
+ "1": "9-10",
17
+ "2": "11-12",
18
+ "3": "general"
19
+ },
20
+ "hidden_size": 768,
21
+ "num_attention_heads": 12,
22
+ "num_hidden_layers": 12,
23
+ "intermediate_size": 3072,
24
+ "hidden_act": "gelu",
25
+ "hidden_dropout_prob": 0.1,
26
+ "attention_probs_dropout_prob": 0.1,
27
+ "max_position_embeddings": 512,
28
+ "vocab_size": 30522,
29
+ "type_vocab_size": 2,
30
+ "initializer_range": 0.02,
31
+ "layer_norm_eps": 1e-12,
32
+ "pad_token_id": 0,
33
+ "problem_type": "single_label_classification",
34
+ "transformers_version": "4.57.6"
35
+ }
task_config.json CHANGED
@@ -20,4 +20,4 @@
20
  },
21
  "description": "Age-appropriate readability classification",
22
  "status": "placeholder"
23
- }
 
20
  },
21
  "description": "Age-appropriate readability classification",
22
  "status": "placeholder"
23
+ }
training_args.json CHANGED
@@ -2,7 +2,7 @@
2
  "model_name": "KothaGPT/bn-en-readability-classifier",
3
  "architecture": "BertForSequenceClassification",
4
  "base_model": "ai4bharat/indic-bert",
5
- "num_labels": 3,
6
  "epochs": 4,
7
  "batch_size": 16,
8
  "learning_rate": 3e-5,
@@ -21,4 +21,4 @@
21
  "evaluation_strategy": "epoch",
22
  "logging_strategy": "steps",
23
  "logging_steps": 100
24
- }
 
2
  "model_name": "KothaGPT/bn-en-readability-classifier",
3
  "architecture": "BertForSequenceClassification",
4
  "base_model": "ai4bharat/indic-bert",
5
+ "num_labels": 4,
6
  "epochs": 4,
7
  "batch_size": 16,
8
  "learning_rate": 3e-5,
 
21
  "evaluation_strategy": "epoch",
22
  "logging_strategy": "steps",
23
  "logging_steps": 100
24
+ }