Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +0 -121
config.json +35 -0
task_config.json +1 -1
training_args.json +2 -2

README.md CHANGED Viewed

@@ -24,124 +24,3 @@ from bilingual import bilingual_api as bb
 result = bb.readability_check("Your text here")
 print(result)
 ```
----
-language:
-- bn
-- en
-license: apache-2.0
-tags:
-- bangla
-- bengali
-- english
-- readability
-- classifier
-- text-quality
-- nlp
-- transformers
-datasets:
-- wikipedia
-- custom
-metrics:
-- accuracy
-- f1
-- precision
-- recall
----
-# Bangla–English Readability Classifier
-This model classifies Bangla and English text into readability levels — *simple*, *medium*, or *complex*.
-It is part of the **KothaGPT Bilingual NLP suite**, trained on parallel corpora combining **Bangla Wikipedia**, **news articles**, and **simplified text datasets**.
----
-## 🧠 Model Description
-- **Model Type:** Text classifier (sequence classification)
-- **Base Architecture:** BERT (Multilingual / IndicBERT variant)
-- **Languages:** Bangla (bn), English (en)
-- **Task:** Readability prediction (3-way classification)
-- **License:** Apache 2.0
-- **Framework:** PyTorch + Hugging Face Transformers
----
-## 🧩 Intended Use
-- Educational content simplification
-- Readability filtering in datasets
-- Adaptive text generation evaluation
-- Research in Bangla and bilingual readability modeling
----
-## 🧾 Training Data
-| Source | Description | Size |
-|--------|--------------|------|
-| Bangla Wikipedia | Encyclopedic formal text | 800K sentences |
-| News Articles | Mixed domain readability | 200K sentences |
-| Simplified Text Corpora | Easy Bangla + English parallel samples | 100K sentences |
-**Labels:**
-- `0`: Simple
-- `1`: Medium
-- `2`: Complex
----
-## ⚙️ Training Procedure
-**Preprocessing:**
-- Unicode normalization
-- Sentence length filtering (5–200 tokens)
-- Bilingual tokenization using SentencePiece
-- Balanced sampling across readability levels
-**Hyperparameters:**
-- Epochs: 4
-- Batch size: 16
-- Learning rate: 3e-5
-- Optimizer: AdamW
-- Sequence length: 256
-- Dropout: 0.1
-- Mixed precision: FP16
----
-## 🧪 Evaluation
-| Metric | Dev | Test |
-|--------|-----|------|
-| Accuracy | 0.88 | 0.86 |
-| F1 (macro) | 0.87 | 0.85 |
-| Precision | 0.88 | 0.86 |
-| Recall | 0.87 | 0.84 |
-**Confusion matrix trends:**
-- Some overlap between *medium* and *complex* categories.
-- Simpler texts (Wikipedia Simple or translated corpora) perform best.
----
-## 🚀 Usage Example
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-model_id = "KothaGPT/bn-en-readability-classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForSequenceClassification.from_pretrained(model_id)
-text = "বাংলাদেশের রাজধানী ঢাকা শহরটি দেশের অর্থনৈতিক কেন্দ্র।"
-inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
-with torch.no_grad():
-    logits = model(**inputs).logits
-    pred = torch.argmax(logits, dim=-1).item()
-labels = ["simple", "medium", "complex"]
-print(f"Predicted readability: {labels[pred]}")

 result = bb.readability_check("Your text here")
 print(result)
 ```

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+    "model_type": "bert",
+    "architectures": [
+        "BertForSequenceClassification"
+    ],
+    "task_type": "text-classification",
+    "num_labels": 4,
+    "label2id": {
+        "6-8": 0,
+        "9-10": 1,
+        "11-12": 2,
+        "general": 3
+    },
+    "id2label": {
+        "0": "6-8",
+        "1": "9-10",
+        "2": "11-12",
+        "3": "general"
+    },
+    "hidden_size": 768,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "intermediate_size": 3072,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "attention_probs_dropout_prob": 0.1,
+    "max_position_embeddings": 512,
+    "vocab_size": 30522,
+    "type_vocab_size": 2,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-12,
+    "pad_token_id": 0,
+    "problem_type": "single_label_classification",
+    "transformers_version": "4.57.6"
+}

task_config.json CHANGED Viewed

@@ -20,4 +20,4 @@
   },
   "description": "Age-appropriate readability classification",
   "status": "placeholder"
-}

   },
   "description": "Age-appropriate readability classification",
   "status": "placeholder"
+}

training_args.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "model_name": "KothaGPT/bn-en-readability-classifier",
   "architecture": "BertForSequenceClassification",
   "base_model": "ai4bharat/indic-bert",
-  "num_labels": 3,
   "epochs": 4,
   "batch_size": 16,
   "learning_rate": 3e-5,
@@ -21,4 +21,4 @@
   "evaluation_strategy": "epoch",
   "logging_strategy": "steps",
   "logging_steps": 100
-}

   "model_name": "KothaGPT/bn-en-readability-classifier",
   "architecture": "BertForSequenceClassification",
   "base_model": "ai4bharat/indic-bert",
+  "num_labels": 4,
   "epochs": 4,
   "batch_size": 16,
   "learning_rate": 3e-5,
   "evaluation_strategy": "epoch",
   "logging_strategy": "steps",
   "logging_steps": 100
+}