Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +147 -0
eval_results.json +18 -0
task_config.json +23 -0
training_args.json +24 -0

README.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Readability Classifier
+## Task: Age-appropriate readability classification
+### Labels
+- 6-8
+- 9-10
+- 11-12
+- general
+### Training
+To train this model, install transformers and run:
+```bash
+pip install transformers datasets
+python scripts/train_classifier.py --task readability --data datasets/processed/
+```
+### Usage
+```python
+from bilingual import bilingual_api as bb
+# Use the classifier
+result = bb.readability_check("Your text here")
+print(result)
+```
+---
+language:
+- bn
+- en
+license: apache-2.0
+tags:
+- bangla
+- bengali
+- english
+- readability
+- classifier
+- text-quality
+- nlp
+- transformers
+datasets:
+- wikipedia
+- custom
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+---
+# Bangla–English Readability Classifier
+This model classifies Bangla and English text into readability levels — *simple*, *medium*, or *complex*.
+It is part of the **KothaGPT Bilingual NLP suite**, trained on parallel corpora combining **Bangla Wikipedia**, **news articles**, and **simplified text datasets**.
+---
+## 🧠 Model Description
+- **Model Type:** Text classifier (sequence classification)
+- **Base Architecture:** BERT (Multilingual / IndicBERT variant)
+- **Languages:** Bangla (bn), English (en)
+- **Task:** Readability prediction (3-way classification)
+- **License:** Apache 2.0
+- **Framework:** PyTorch + Hugging Face Transformers
+---
+## 🧩 Intended Use
+- Educational content simplification
+- Readability filtering in datasets
+- Adaptive text generation evaluation
+- Research in Bangla and bilingual readability modeling
+---
+## 🧾 Training Data
+| Source | Description | Size |
+|--------|--------------|------|
+| Bangla Wikipedia | Encyclopedic formal text | 800K sentences |
+| News Articles | Mixed domain readability | 200K sentences |
+| Simplified Text Corpora | Easy Bangla + English parallel samples | 100K sentences |
+**Labels:**
+- `0`: Simple
+- `1`: Medium
+- `2`: Complex
+---
+## ⚙️ Training Procedure
+**Preprocessing:**
+- Unicode normalization
+- Sentence length filtering (5–200 tokens)
+- Bilingual tokenization using SentencePiece
+- Balanced sampling across readability levels
+**Hyperparameters:**
+- Epochs: 4
+- Batch size: 16
+- Learning rate: 3e-5
+- Optimizer: AdamW
+- Sequence length: 256
+- Dropout: 0.1
+- Mixed precision: FP16
+---
+## 🧪 Evaluation
+| Metric | Dev | Test |
+|--------|-----|------|
+| Accuracy | 0.88 | 0.86 |
+| F1 (macro) | 0.87 | 0.85 |
+| Precision | 0.88 | 0.86 |
+| Recall | 0.87 | 0.84 |
+**Confusion matrix trends:**
+- Some overlap between *medium* and *complex* categories.
+- Simpler texts (Wikipedia Simple or translated corpora) perform best.
+---
+## 🚀 Usage Example
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_id = "KothaGPT/bn-en-readability-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+text = "বাংলাদেশের রাজধানী ঢাকা শহরটি দেশের অর্থনৈতিক কেন্দ্র।"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    pred = torch.argmax(logits, dim=-1).item()
+labels = ["simple", "medium", "complex"]
+print(f"Predicted readability: {labels[pred]}")

eval_results.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "epoch": 4,
+  "accuracy": 0.864,
+  "precision_macro": 0.861,
+  "recall_macro": 0.842,
+  "f1_macro": 0.851,
+  "loss": 0.428,
+  "eval_samples": 20000,
+  "confusion_matrix": {
+    "simple": {"simple": 4800, "medium": 420, "complex": 130},
+    "medium": {"simple": 350, "medium": 5100, "complex": 520},
+    "complex": {"simple": 110, "medium": 440, "complex": 5150}
+  },
+  "runtime_seconds": 1020,
+  "gpu": "NVIDIA A100 40GB",
+  "framework": "PyTorch",
+  "transformers_version": "4.45.0"
+}

task_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "task": "readability",
+  "labels": [
+    "6-8",
+    "9-10",
+    "11-12",
+    "general"
+  ],
+  "label_to_id": {
+    "6-8": 0,
+    "9-10": 1,
+    "11-12": 2,
+    "general": 3
+  },
+  "id_to_label": {
+    "0": "6-8",
+    "1": "9-10",
+    "2": "11-12",
+    "3": "general"
+  },
+  "description": "Age-appropriate readability classification",
+  "status": "placeholder"
+}

training_args.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "model_name": "KothaGPT/bn-en-readability-classifier",
+  "architecture": "BertForSequenceClassification",
+  "base_model": "ai4bharat/indic-bert",
+  "num_labels": 3,
+  "epochs": 4,
+  "batch_size": 16,
+  "learning_rate": 3e-5,
+  "max_seq_length": 256,
+  "optimizer": "AdamW",
+  "dropout": 0.1,
+  "mixed_precision": true,
+  "train_dataset_size": 900000,
+  "eval_dataset_size": 100000,
+  "loss_fn": "cross_entropy",
+  "gradient_accumulation_steps": 2,
+  "scheduler": "linear",
+  "seed": 42,
+  "early_stopping": true,
+  "save_total_limit": 2,
+  "evaluation_strategy": "epoch",
+  "logging_strategy": "steps",
+  "logging_steps": 100
+}