vrashad commited on
Commit
ce91500
1 Parent(s): 2a3fea5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -3
README.md CHANGED
@@ -1,3 +1,108 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - az
5
+ pipeline_tag: text-classification
6
+ ---
7
+ # Sentiment Analysis in Azerbaijani Language
8
+
9
+ ## Model Description
10
+ This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.
11
+
12
+ ## How to Use
13
+ You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below.
14
+
15
+ ### Quick Start
16
+ First, install the transformers library if you haven't already:
17
+ ```bash
18
+ pip install transformers
19
+ ```
20
+
21
+ ```python
22
+ from transformers import AutoModelForSequenceClassification, XLMRobertaTokenizer
23
+ import torch
24
+
25
+ # Load tokenizer and model
26
+ tokenizer = XLMRobertaTokenizer.from_pretrained("LocalDoc/language_detection")
27
+ model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")
28
+
29
+ # Prepare text
30
+ text = "Əlqasım oğulları vorzakondu"
31
+ encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
32
+
33
+ # Prediction
34
+ model.eval()
35
+ with torch.no_grad():
36
+ outputs = model(**encoded_input)
37
+
38
+ # Process the outputs
39
+ logits = outputs.logits
40
+ probabilities = torch.nn.functional.softmax(logits, dim=-1)
41
+ predicted_class_index = probabilities.argmax().item()
42
+ labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
43
+ predicted_label = labels[predicted_class_index]
44
+ print(f"Predicted Language: {predicted_label}")
45
+ ```
46
+
47
+ ## Language Label Information
48
+
49
+ The model outputs a label for each prediction, corresponding to one of the languages listed below. Each label is associated with a specific language code as detailed in the following table:
50
+
51
+ | Label | Language Code | Language Name |
52
+ |-------|---------------|---------------|
53
+ | 0 | az | Azerbaijani |
54
+ | LABEL_1 | ar | Arabic |
55
+ | LABEL_2 | bg | Bulgarian |
56
+ | LABEL_3 | de | German |
57
+ | LABEL_4 | el | Greek |
58
+ | LABEL_5 | en | English |
59
+ | LABEL_6 | es | Spanish |
60
+ | LABEL_7 | fr | French |
61
+ | LABEL_8 | hi | Hindi |
62
+ | LABEL_9 | it | Italian |
63
+ | LABEL_10 | ja | Japanese |
64
+ | LABEL_11 | nl | Dutch |
65
+ | LABEL_12 | pl | Polish |
66
+ | LABEL_13 | pt | Portuguese |
67
+ | LABEL_14 | ru | Russian |
68
+ | LABEL_15 | sw | Swahili |
69
+ | LABEL_16 | th | Thai |
70
+ | LABEL_17 | tr | Turkish |
71
+ | LABEL_18 | ur | Urdu |
72
+ | LABEL_19 | vi | Vietnamese |
73
+ | LABEL_20 | zh | Chinese |
74
+
75
+ This mapping is utilized to decode the model's predictions into understandable language names, facilitating the interpretation of results for further processing or analysis.
76
+
77
+
78
+ Training Performance
79
+
80
+ The model was trained over three epochs, showing consistent improvement in accuracy and loss:
81
+
82
+ Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
83
+ Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
84
+ Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984
85
+
86
+ Test Results
87
+
88
+ The model achieved the following results on the test set:
89
+
90
+ Loss: 0.0133
91
+ Accuracy: 0.9975
92
+ F1 Score: 0.9975
93
+ Precision: 0.9975
94
+ Recall: 0.9975
95
+ Evaluation Time: 17.5 seconds
96
+ Samples per Second: 599.685
97
+ Steps per Second: 9.424
98
+
99
+
100
+ License
101
+
102
+ The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works.
103
+
104
+
105
+
106
+ Contact information
107
+
108
+ If you have any questions or suggestions, please contact us at [v.resad.89@gmail.com].