--- license: apache-2.0 datasets: - papluca/language-identification language: - en - de - fr - es metrics: - precision - recall - f1 - accuracy pipeline_tag: text-classification --- # German, English, French and Spanish Language Detector The GEFS-language-detector model outperformed by achieving an impressive F1 score close to 100%. This result significantly exceeds typical benchmarks and underscores the model's accuracy and reliability in identifying languages. This is a fined tuned model by using the dataset of papluca [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) and the base model [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) . ## Predicted output: Model will return the language detection in the language codes like: ``` - de as German - en as English - fr as French - es as Spanish ``` ## Supported languages Currently this model support 4 languages but in future more languages will be added. Following languages supported by the model: - German (de) - English (en) - French (fr) - Spanish (es) # Use a pipeline as a high-level helper ```python from transformers import pipeline text=["Mir gefällt die Art und Weise, Sprachen zu erkennen", "I like the way to detect languages", "Me gusta la forma de detectar idiomas", "J'aime la façon de détecter les langues"] pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector") lang_detect=pipe(text, top_k=1) print("The detected language is", lang_detect) ``` # Load model directly ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector") model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector") ``` ## Model Training Epoch Training Loss Validation Loss 1 0.002600 0.000148 2 0.001000 0.000015 3 0.000000 0.000011 4 0.001800 0.000009 5 0.002700 0.000016 6 0.001600 0.000012 7 0.001300 0.000009 8 0.001200 0.000008 9 0.000900 0.000007 10 0.000900 0.000007 ## Testing Results ``` Language Precision Recall F1 Accuracy de 0.9997 0.9998 0.9998 0.9999 en 1.0000 1.0000 1.0000 1.0000 fr 0.9995 0.9996 0.9996 0.9996 es 0.9994 0.9996 0.9995 0.9996 ``` ## About Author **Name**: Muhammad Imran Zaman **Company**: [Theum AG](https://theum.com/en/index.htm?t=) **Role**: Lead Machine Learning Engineer **Professional Links**: - Kaggle: [Profile](https://www.kaggle.com/muhammadimran112233) - LinkedIn: [Profile](linkedin.com/in/muhammad-imran-zaman) - Google Scholar: [Profile](https://scholar.google.com/citations?user=ulVFpy8AAAAJ&hl=en) - YouTube: [Channel](https://www.youtube.com/@consolioo) - GitHub: [Channel](https://github.com/Imran-ml)