Spaces:
Build error
A newer version of the Gradio SDK is available:
5.35.0
title: Language Detection App
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit
π Language Detection App
A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.
β¨ Features
- Clean Gradio Interface: Simple, intuitive web interface for language detection
- Multiple Model Architectures: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
- Multiple Training Datasets: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
- Centralized Configuration: All model configurations and settings in one place
- Modular Backend: Easy-to-extend architecture for plugging in your own ML models
- Real-time Detection: Instant language detection with confidence scores
- Multiple Predictions: Shows top 5 language predictions with confidence levels
- 100+ Languages: Support for major world languages (varies by model)
- Example Texts: Pre-loaded examples in various languages for testing
- Model Switching: Seamlessly switch between different models
- Extensible: Abstract base class for implementing custom models
π Quick Start
1. Setup Environment
# Create virtual environment
python -m venv venv
# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
2. Test the Backend
# Run tests to verify everything works
python test_app.py
# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py
3. Launch the App
# Start the Gradio app
python app.py
The app will be available at http://localhost:7860
π§© Model Architecture
The system is organized around two dimensions:
ποΈ Model Architectures
- Model A: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
- Model B: BERT based architectures - Efficient and fast processing
π Training Datasets
- Dataset A: Standard multilingual language detection dataset - Broad language coverage
- Dataset B: Enhanced/specialized language detection dataset - Ultra-high accuracy focus
π€ Available Model Combinations
Model A Dataset A - XLM-RoBERTa + Standard Dataset β
- Architecture: XLM-RoBERTa (Model A)
- Training: Dataset A (standard multilingual)
- Accuracy: 97.9%
- Size: 278M parameters
- Languages: 100+ languages
- Strengths: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
- Use Cases: General-purpose language detection, multilingual content processing
Model B Dataset A - BERT + Standard Dataset β
- Architecture: BERT (Model B)
- Training: Dataset A (standard multilingual)
- Accuracy: 96.17%
- Size: 178M parameters
- Languages: 100+ languages
- Strengths: Fast inference, broad language support, efficient processing
- Use Cases: High-throughput detection, real-time applications, resource-constrained environments
Model A Dataset B - XLM-RoBERTa + Enhanced Dataset β
- Architecture: XLM-RoBERTa (Model A)
- Training: Dataset B (enhanced/specialized)
- Accuracy: 99.72%
- Size: 278M parameters
- Training Loss: 0.0176
- Languages: 20 carefully selected languages
- Strengths: Exceptional accuracy, focused language support, state-of-the-art results
- Use Cases: Research applications, high-precision detection, critical accuracy requirements
Model B Dataset B - BERT + Enhanced Dataset β
- Architecture: BERT (Model B)
- Training: Dataset B (enhanced/specialized)
- Accuracy: 99.85%
- Size: 178M parameters
- Training Loss: 0.0125
- Languages: 20 carefully selected languages
- Strengths: Highest accuracy, ultra-low training loss, precision-optimized
- Use Cases: Maximum precision applications, research requiring highest accuracy
ποΈ Core Components
BaseLanguageModel
: Abstract interface that all models must implementModelRegistry
: Manages model registration and creation with centralized configurationLanguageDetector
: Main orchestrator for language detectionmodel_config.py
: Centralized configuration for all models and language mappings
π§ Adding New Models
To add a new model combination, simply:
- Create a new file in
backend/models/
(e.g.,model_c_dataset_a.py
) - Inherit from
BaseLanguageModel
- Implement the required methods
- Add configuration to
model_config.py
- Register it in
ModelRegistry
Example:
# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config
class ModelCDatasetA(BaseLanguageModel):
def __init__(self):
self.model_key = "model-c-dataset-a"
self.config = get_model_config(self.model_key)
# Initialize your model
def predict(self, text: str) -> Dict[str, Any]:
# Implement prediction logic
pass
def get_supported_languages(self) -> List[str]:
# Return supported language codes
pass
def get_model_info(self) -> Dict[str, Any]:
# Return model metadata from config
pass
Then add configuration in model_config.py
and register in language_detector.py
.
π§ͺ Testing
The project includes comprehensive test suites:
test_app.py
: General app functionality teststest_model_a_dataset_a.py
: Tests for XLM-RoBERTa + standard datasettest_model_b_dataset_b.py
: Tests for BERT + enhanced dataset (highest accuracy)- Model comparison tests: Automated testing across all model combinations
- Model switching tests: Verify seamless model switching
π Supported Languages
The models support different language sets based on their training:
- Model A/B + Dataset A: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
- Model A/B + Dataset B: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)
π Model Comparison
Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B |
---|---|---|---|---|
Architecture | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT |
Dataset | Standard | Standard | Enhanced | Enhanced |
Accuracy | 97.9% | 96.17% | 99.72% | 99.85% π |
Model Size | 278M | 178M | 278M | 178M |
Languages | 100+ | 100+ | 20 (curated) | 20 (curated) |
Training Loss | N/A | N/A | 0.0176 | 0.0125 |
Speed | Moderate | Fast | Moderate | Fast |
Memory Usage | Higher | Lower | Higher | Lower |
Best For | Balanced performance | Speed & broad coverage | Ultra-high accuracy | Maximum precision |
π― Model Selection Guide
- π Model B Dataset B: Choose for maximum accuracy on 20 core languages (99.85%)
- π¬ Model A Dataset B: Choose for ultra-high accuracy on 20 core languages (99.72%)
- βοΈ Model A Dataset A: Choose for balanced performance and comprehensive language coverage (97.9%)
- β‘ Model B Dataset A: Choose for fast inference and broad language coverage (96.17%)
π§ Configuration
You can configure models using the centralized configuration system:
# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a") # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a") # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b") # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b") # Maximum precision BERT
# All configurations are centralized in backend/models/model_config.py
π Project Structure
language-detection/
βββ backend/
β βββ models/
β β βββ model_config.py # Centralized configuration
β β βββ base_model.py # Abstract base class
β β βββ model_a_dataset_a.py # XLM-RoBERTa + Standard
β β βββ model_b_dataset_a.py # BERT + Standard
β β βββ model_a_dataset_b.py # XLM-RoBERTa + Enhanced
β β βββ model_b_dataset_b.py # BERT + Enhanced
β β βββ __init__.py
β βββ language_detector.py # Main orchestrator
βββ tests/
βββ app.py # Gradio interface
βββ README.md
π€ Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/new-model-combination
) - Implement your model following the
BaseLanguageModel
interface - Add configuration to
model_config.py
- Add tests for your implementation
- Commit your changes (
git commit -m 'Add new model combination'
) - Push to the branch (
git push origin feature/new-model-combination
) - Open a Pull Request
π License
This project is open source and available under the MIT License.
π Acknowledgments
- Hugging Face for the transformers library and model hosting platform
- Model providers for the fine-tuned language detection models used in this project
- Gradio for the excellent web interface framework
- Open source community for the foundational technologies that make this project possible