language-detection / README.md
yangdingcheok's picture
Upload 3 files
ede5327 verified

A newer version of the Gradio SDK is available: 5.35.0

Upgrade
metadata
title: Language Detection App
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit

🌍 Language Detection App

A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.

✨ Features

  • Clean Gradio Interface: Simple, intuitive web interface for language detection
  • Multiple Model Architectures: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
  • Multiple Training Datasets: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
  • Centralized Configuration: All model configurations and settings in one place
  • Modular Backend: Easy-to-extend architecture for plugging in your own ML models
  • Real-time Detection: Instant language detection with confidence scores
  • Multiple Predictions: Shows top 5 language predictions with confidence levels
  • 100+ Languages: Support for major world languages (varies by model)
  • Example Texts: Pre-loaded examples in various languages for testing
  • Model Switching: Seamlessly switch between different models
  • Extensible: Abstract base class for implementing custom models

πŸš€ Quick Start

1. Setup Environment

# Create virtual environment
python -m venv venv

# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Test the Backend

# Run tests to verify everything works
python test_app.py

# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py

3. Launch the App

# Start the Gradio app
python app.py

The app will be available at http://localhost:7860

🧩 Model Architecture

The system is organized around two dimensions:

πŸ—οΈ Model Architectures

  • Model A: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
  • Model B: BERT based architectures - Efficient and fast processing

πŸ“Š Training Datasets

  • Dataset A: Standard multilingual language detection dataset - Broad language coverage
  • Dataset B: Enhanced/specialized language detection dataset - Ultra-high accuracy focus

πŸ€– Available Model Combinations

  1. Model A Dataset A - XLM-RoBERTa + Standard Dataset βœ…

    • Architecture: XLM-RoBERTa (Model A)
    • Training: Dataset A (standard multilingual)
    • Accuracy: 97.9%
    • Size: 278M parameters
    • Languages: 100+ languages
    • Strengths: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
    • Use Cases: General-purpose language detection, multilingual content processing
  2. Model B Dataset A - BERT + Standard Dataset βœ…

    • Architecture: BERT (Model B)
    • Training: Dataset A (standard multilingual)
    • Accuracy: 96.17%
    • Size: 178M parameters
    • Languages: 100+ languages
    • Strengths: Fast inference, broad language support, efficient processing
    • Use Cases: High-throughput detection, real-time applications, resource-constrained environments
  3. Model A Dataset B - XLM-RoBERTa + Enhanced Dataset βœ…

    • Architecture: XLM-RoBERTa (Model A)
    • Training: Dataset B (enhanced/specialized)
    • Accuracy: 99.72%
    • Size: 278M parameters
    • Training Loss: 0.0176
    • Languages: 20 carefully selected languages
    • Strengths: Exceptional accuracy, focused language support, state-of-the-art results
    • Use Cases: Research applications, high-precision detection, critical accuracy requirements
  4. Model B Dataset B - BERT + Enhanced Dataset βœ…

    • Architecture: BERT (Model B)
    • Training: Dataset B (enhanced/specialized)
    • Accuracy: 99.85%
    • Size: 178M parameters
    • Training Loss: 0.0125
    • Languages: 20 carefully selected languages
    • Strengths: Highest accuracy, ultra-low training loss, precision-optimized
    • Use Cases: Maximum precision applications, research requiring highest accuracy

πŸ—οΈ Core Components

  • BaseLanguageModel: Abstract interface that all models must implement
  • ModelRegistry: Manages model registration and creation with centralized configuration
  • LanguageDetector: Main orchestrator for language detection
  • model_config.py: Centralized configuration for all models and language mappings

πŸ”§ Adding New Models

To add a new model combination, simply:

  1. Create a new file in backend/models/ (e.g., model_c_dataset_a.py)
  2. Inherit from BaseLanguageModel
  3. Implement the required methods
  4. Add configuration to model_config.py
  5. Register it in ModelRegistry

Example:

# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config

class ModelCDatasetA(BaseLanguageModel):
    def __init__(self):
        self.model_key = "model-c-dataset-a"
        self.config = get_model_config(self.model_key)
        # Initialize your model
    
    def predict(self, text: str) -> Dict[str, Any]:
        # Implement prediction logic
        pass
    
    def get_supported_languages(self) -> List[str]:
        # Return supported language codes
        pass
    
    def get_model_info(self) -> Dict[str, Any]:
        # Return model metadata from config
        pass

Then add configuration in model_config.py and register in language_detector.py.

πŸ§ͺ Testing

The project includes comprehensive test suites:

  • test_app.py: General app functionality tests
  • test_model_a_dataset_a.py: Tests for XLM-RoBERTa + standard dataset
  • test_model_b_dataset_b.py: Tests for BERT + enhanced dataset (highest accuracy)
  • Model comparison tests: Automated testing across all model combinations
  • Model switching tests: Verify seamless model switching

🌐 Supported Languages

The models support different language sets based on their training:

  • Model A/B + Dataset A: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
  • Model A/B + Dataset B: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)

πŸ“Š Model Comparison

Feature Model A Dataset A Model B Dataset A Model A Dataset B Model B Dataset B
Architecture XLM-RoBERTa BERT XLM-RoBERTa BERT
Dataset Standard Standard Enhanced Enhanced
Accuracy 97.9% 96.17% 99.72% 99.85% πŸ†
Model Size 278M 178M 278M 178M
Languages 100+ 100+ 20 (curated) 20 (curated)
Training Loss N/A N/A 0.0176 0.0125
Speed Moderate Fast Moderate Fast
Memory Usage Higher Lower Higher Lower
Best For Balanced performance Speed & broad coverage Ultra-high accuracy Maximum precision

🎯 Model Selection Guide

  • πŸ† Model B Dataset B: Choose for maximum accuracy on 20 core languages (99.85%)
  • πŸ”¬ Model A Dataset B: Choose for ultra-high accuracy on 20 core languages (99.72%)
  • βš–οΈ Model A Dataset A: Choose for balanced performance and comprehensive language coverage (97.9%)
  • ⚑ Model B Dataset A: Choose for fast inference and broad language coverage (96.17%)

πŸ”§ Configuration

You can configure models using the centralized configuration system:

# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a")  # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a")  # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b")  # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b")  # Maximum precision BERT

# All configurations are centralized in backend/models/model_config.py

πŸ“ Project Structure

language-detection/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ model_config.py          # Centralized configuration
β”‚   β”‚   β”œβ”€β”€ base_model.py            # Abstract base class
β”‚   β”‚   β”œβ”€β”€ model_a_dataset_a.py     # XLM-RoBERTa + Standard
β”‚   β”‚   β”œβ”€β”€ model_b_dataset_a.py     # BERT + Standard
β”‚   β”‚   β”œβ”€β”€ model_a_dataset_b.py     # XLM-RoBERTa + Enhanced
β”‚   β”‚   β”œβ”€β”€ model_b_dataset_b.py     # BERT + Enhanced
β”‚   β”‚   └── __init__.py
β”‚   └── language_detector.py         # Main orchestrator
β”œβ”€β”€ tests/
β”œβ”€β”€ app.py                           # Gradio interface
└── README.md

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/new-model-combination)
  3. Implement your model following the BaseLanguageModel interface
  4. Add configuration to model_config.py
  5. Add tests for your implementation
  6. Commit your changes (git commit -m 'Add new model combination')
  7. Push to the branch (git push origin feature/new-model-combination)
  8. Open a Pull Request

πŸ“ License

This project is open source and available under the MIT License.

πŸ™ Acknowledgments

  • Hugging Face for the transformers library and model hosting platform
  • Model providers for the fine-tuned language detection models used in this project
  • Gradio for the excellent web interface framework
  • Open source community for the foundational technologies that make this project possible