metadata

title: Language Detection App
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit

🌍 Language Detection App

A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.

✨ Features

Clean Gradio Interface: Simple, intuitive web interface for language detection
Multiple Model Architectures: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
Multiple Training Datasets: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
Centralized Configuration: All model configurations and settings in one place
Modular Backend: Easy-to-extend architecture for plugging in your own ML models
Real-time Detection: Instant language detection with confidence scores
Multiple Predictions: Shows top 5 language predictions with confidence levels
100+ Languages: Support for major world languages (varies by model)
Example Texts: Pre-loaded examples in various languages for testing
Model Switching: Seamlessly switch between different models
Extensible: Abstract base class for implementing custom models

🚀 Quick Start

1. Setup Environment

# Create virtual environment
python -m venv venv

# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Test the Backend

# Run tests to verify everything works
python test_app.py

# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py

3. Launch the App

# Start the Gradio app
python app.py

The app will be available at http://localhost:7860

🧩 Model Architecture

The system is organized around two dimensions:

🏗️ Model Architectures

Model A: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
Model B: BERT based architectures - Efficient and fast processing

📊 Training Datasets

Dataset A: Standard multilingual language detection dataset - Broad language coverage
Dataset B: Enhanced/specialized language detection dataset - Ultra-high accuracy focus

🤖 Available Model Combinations

Model A Dataset A - XLM-RoBERTa + Standard Dataset ✅
- Architecture: XLM-RoBERTa (Model A)
- Training: Dataset A (standard multilingual)
- Accuracy: 97.9%
- Size: 278M parameters
- Languages: 100+ languages
- Strengths: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
- Use Cases: General-purpose language detection, multilingual content processing
Model B Dataset A - BERT + Standard Dataset ✅
- Architecture: BERT (Model B)
- Training: Dataset A (standard multilingual)
- Accuracy: 96.17%
- Size: 178M parameters
- Languages: 100+ languages
- Strengths: Fast inference, broad language support, efficient processing
- Use Cases: High-throughput detection, real-time applications, resource-constrained environments
Model A Dataset B - XLM-RoBERTa + Enhanced Dataset ✅
- Architecture: XLM-RoBERTa (Model A)
- Training: Dataset B (enhanced/specialized)
- Accuracy: 99.72%
- Size: 278M parameters
- Training Loss: 0.0176
- Languages: 20 carefully selected languages
- Strengths: Exceptional accuracy, focused language support, state-of-the-art results
- Use Cases: Research applications, high-precision detection, critical accuracy requirements
Model B Dataset B - BERT + Enhanced Dataset ✅
- Architecture: BERT (Model B)
- Training: Dataset B (enhanced/specialized)
- Accuracy: 99.85%
- Size: 178M parameters
- Training Loss: 0.0125
- Languages: 20 carefully selected languages
- Strengths: Highest accuracy, ultra-low training loss, precision-optimized
- Use Cases: Maximum precision applications, research requiring highest accuracy

🏗️ Core Components

BaseLanguageModel: Abstract interface that all models must implement
ModelRegistry: Manages model registration and creation with centralized configuration
LanguageDetector: Main orchestrator for language detection
model_config.py: Centralized configuration for all models and language mappings

🔧 Adding New Models

To add a new model combination, simply:

Create a new file in backend/models/ (e.g., model_c_dataset_a.py)
Inherit from BaseLanguageModel
Implement the required methods
Add configuration to model_config.py
Register it in ModelRegistry

Example:

# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config

class ModelCDatasetA(BaseLanguageModel):
    def __init__(self):
        self.model_key = "model-c-dataset-a"
        self.config = get_model_config(self.model_key)
        # Initialize your model
    
    def predict(self, text: str) -> Dict[str, Any]:
        # Implement prediction logic
        pass
    
    def get_supported_languages(self) -> List[str]:
        # Return supported language codes
        pass
    
    def get_model_info(self) -> Dict[str, Any]:
        # Return model metadata from config
        pass

Then add configuration in model_config.py and register in language_detector.py.

🧪 Testing

The project includes comprehensive test suites:

test_app.py: General app functionality tests
test_model_a_dataset_a.py: Tests for XLM-RoBERTa + standard dataset
test_model_b_dataset_b.py: Tests for BERT + enhanced dataset (highest accuracy)
Model comparison tests: Automated testing across all model combinations
Model switching tests: Verify seamless model switching

🌐 Supported Languages

The models support different language sets based on their training:

Model A/B + Dataset A: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
Model A/B + Dataset B: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)

📊 Model Comparison

Feature	Model A Dataset A	Model B Dataset A	Model A Dataset B	Model B Dataset B
Architecture	XLM-RoBERTa	BERT	XLM-RoBERTa	BERT
Dataset	Standard	Standard	Enhanced	Enhanced
Accuracy	97.9%	96.17%	99.72%	99.85% 🏆
Model Size	278M	178M	278M	178M
Languages	100+	100+	20 (curated)	20 (curated)
Training Loss	N/A	N/A	0.0176	0.0125
Speed	Moderate	Fast	Moderate	Fast
Memory Usage	Higher	Lower	Higher	Lower
Best For	Balanced performance	Speed & broad coverage	Ultra-high accuracy	Maximum precision

🎯 Model Selection Guide

🏆 Model B Dataset B: Choose for maximum accuracy on 20 core languages (99.85%)
🔬 Model A Dataset B: Choose for ultra-high accuracy on 20 core languages (99.72%)
⚖️ Model A Dataset A: Choose for balanced performance and comprehensive language coverage (97.9%)
⚡ Model B Dataset A: Choose for fast inference and broad language coverage (96.17%)

🔧 Configuration

You can configure models using the centralized configuration system:

# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a")  # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a")  # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b")  # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b")  # Maximum precision BERT

# All configurations are centralized in backend/models/model_config.py

📁 Project Structure

language-detection/
├── backend/
│   ├── models/
│   │   ├── model_config.py          # Centralized configuration
│   │   ├── base_model.py            # Abstract base class
│   │   ├── model_a_dataset_a.py     # XLM-RoBERTa + Standard
│   │   ├── model_b_dataset_a.py     # BERT + Standard
│   │   ├── model_a_dataset_b.py     # XLM-RoBERTa + Enhanced
│   │   ├── model_b_dataset_b.py     # BERT + Enhanced
│   │   └── __init__.py
│   └── language_detector.py         # Main orchestrator
├── tests/
├── app.py                           # Gradio interface
└── README.md

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/new-model-combination)
Implement your model following the BaseLanguageModel interface
Add configuration to model_config.py
Add tests for your implementation
Commit your changes (git commit -m 'Add new model combination')
Push to the branch (git push origin feature/new-model-combination)
Open a Pull Request

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Hugging Face for the transformers library and model hosting platform
Model providers for the fine-tuned language detection models used in this project
Gradio for the excellent web interface framework
Open source community for the foundational technologies that make this project possible