File size: 3,096 Bytes
9f74b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d4e785
 
0ab0bcd
5282cf8
7d4e785
e5a9824
 
 
 
f2bc8c0
8c0a822
 
 
 
f2bc8c0
e5a9824
7d4e785
b0d6224
7d4e785
 
e5a9824
 
 
 
7d4e785
cfc0f6e
 
 
c400a42
 
 
 
 
 
cfc0f6e
c400a42
 
cfc0f6e
 
 
 
 
 
 
 
 
c400a42
cfc0f6e
 
b0d6224
8c0a822
79f5310
 
 
 
 
 
 
 
 
 
 
7d4e785
 
 
e13ff7a
79f5310
 
 
 
 
e13ff7a
e5a9824
 
 
c400a42
 
589cbb7
 
b726268
791abb8
589cbb7
2455542
 
 
 
e13ff7a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: apache-2.0
datasets:
- papluca/language-identification
language:
- en
- de
- fr
- es
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: text-classification
---
# German, English, French and Spanish Language Detector

The GEFS-language-detector model outperformed by achieving an impressive F1 score close to 100%. This result significantly exceeds typical benchmarks and underscores the model's accuracy and reliability in identifying languages.
This is a fined tuned model by using the dataset of papluca [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) and the base model [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) .


## Predicted output:

Model will return the language detection in the language codes like: 
```
  - de as German
  - en as English
  - fr as French
  - es as Spanish
```
  
## Supported languages
Currently this model support 4 languages but in future more languages will be added. 

Following languages supported by the model:
- German (de)
- English (en)
- French (fr)
- Spanish (es)

# Use a pipeline as a high-level helper

```python
from transformers import pipeline

text=["Mir gefällt die Art und Weise, Sprachen zu erkennen",
      "I like the way to detect languages",
      "Me gusta la forma de detectar idiomas",
      "J'aime la façon de détecter les langues"]
pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector")
lang_detect=pipe(text, top_k=1)
print("The detected language is", lang_detect)
```

# Load model directly

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector")
model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector")

```

## Model Training
  
    Epoch	  Training Loss	    Validation Loss
    1	      0.002600	        0.000148  
    2	      0.001000	        0.000015
    3	      0.000000	        0.000011
    4	      0.001800	        0.000009
    5	      0.002700	        0.000016
    6	      0.001600	        0.000012
    7	      0.001300	        0.000009
    8	      0.001200	        0.000008
    9	      0.000900	        0.000007
    10	      0.000900	        0.000007


## Testing Results
```
    Language   Precision   Recall	F1 	     Accuracy
    de	       0.9997	   0.9998	0.9998   0.9999
    en	       1.0000	   1.0000	1.0000	 1.0000
    fr	       0.9995	   0.9996	0.9996	 0.9996
    es	       0.9994	   0.9996	0.9995	 0.9996
```



## About Author

  **Name**: Muhammad Imran Zaman 
  **Company**: [Theum AG](https://theum.com/en/index.htm?t=) 
  **Role**: Lead Machine Learning Engineer 

  **Professional Links**:
  - Kaggle: [Profile](https://www.kaggle.com/muhammadimran112233)
  - LinkedIn: [Profile](linkedin.com/in/muhammad-imran-zaman)
  - Google Scholar: [Profile](https://scholar.google.com/citations?user=ulVFpy8AAAAJ&hl=en)
  - YouTube: [Channel](https://www.youtube.com/@consolioo)
  - GitHub: [Channel](https://github.com/Imran-ml)