NiGuLa commited on
Commit
7d09a80
2 Parent(s): 5149f91 1a7bf8e

Merge branch 'main' of https://huggingface.co/Skoltech/russian-inappropriate-messages into main

Browse files
Files changed (1) hide show
  1. README.md +19 -2
README.md CHANGED
@@ -11,9 +11,20 @@ licenses:
11
 
12
  ## General concept of the model
13
 
14
- This model is trained on the dataset of inappropriate messages of the Russian language. The concept of inappropriateness is described [in this article ](https://www.aclweb.org/anthology/2021.bsnlp-1.4/) presented at the workshop for Balto-Slavic NLP at the EACL-2021 conference. Please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our [GitHub](https://github.com/skoltech-nlp/inappropriate-sensitive-topics/blob/main/Version2/appropriateness/Appropriateness.csv) or on [kaggle](https://www.kaggle.com/nigula/russianinappropriatemessages). The properties of the dataset is the same as the one described in the article, the only difference is the size.
 
15
 
16
- The model was trained, validated and tested only on the samples with 100% confidence, which allowed to get the following metrics on test set:
 
 
 
 
 
 
 
 
 
 
17
 
18
  | | precision | recall | f1-score | support |
19
  |--------------|----------|--------|----------|---------|
@@ -23,6 +34,12 @@ The model was trained, validated and tested only on the samples with 100% confid
23
  | macro avg | 0.86 | 0.85 | 0.85 | 10565 |
24
  | weighted avg | 0.89 | 0.89 | 0.89 | 10565 |
25
 
 
 
 
 
 
 
26
  ## Licensing Information
27
 
28
  [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].
 
11
 
12
  ## General concept of the model
13
 
14
+ #### Inappropraiteness intuition
15
+ This model is trained on the dataset of inappropriate messages of the Russian language. Generally, an inappropriate utterance is an utterance that has not obscene words or any kind of toxic intent, but can still harm the reputation of the speaker. Find some sample for more intuition in the table below. Learn more about the concept of inappropriateness [in this article ](https://www.aclweb.org/anthology/2021.bsnlp-1.4/) presented at the workshop for Balto-Slavic NLP at the EACL-2021 conference. Please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our [GitHub](https://github.com/skoltech-nlp/inappropriate-sensitive-topics/blob/main/Version2/appropriateness/Appropriateness.csv) or on [kaggle](https://www.kaggle.com/nigula/russianinappropriatemessages). The properties of the dataset are the same as the one described in the article, the only difference is the size.
16
 
17
+ | Text | Inappropriate | Comment | Unsafe topic |
18
+ |------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------|---------------|
19
+ | Ладно бы видного деятеля завалили а тут какого то ноунейм нигру преступника | yes | Оправдание убийства | offline crime |
20
+ | Это нарушение УКРФ! | no | Осуждение нарушения закона | offline crime |
21
+ | Религия - это насилие высшего уровня. Верующего принуждают к вере еще в младенчестве. | yes | Оскорбление чувств верующих | religion |
22
+ | Христиане, как вы относитесь к иноверцам, в частности к Гамалиилу и к уверовавшим в Иисуса фарисеям? | no | Обычные вопросы к верующим без намека на оскорбление их чувств | religion |
23
+ | Хорошо что проституция запрещена | no | Осуждение простуитуции | prostitution |
24
+ | Возьму бабки и сниму шлюх на выходных | yes | Популяризация проституции | prostitution |
25
+
26
+ #### Metrics
27
+ The model was trained, validated, and tested only on the samples with 100% confidence, which allowed to get the following metrics on test set:
28
 
29
  | | precision | recall | f1-score | support |
30
  |--------------|----------|--------|----------|---------|
 
34
  | macro avg | 0.86 | 0.85 | 0.85 | 10565 |
35
  | weighted avg | 0.89 | 0.89 | 0.89 | 10565 |
36
 
37
+ #### Proposed usage
38
+
39
+ The 'inappropriateness' substance we tried to collect in the dataset and detect with the model is not a substitution of toxicity, it is rather a derivative of toxicity. So the model based on our dataset could serve as an additional layer of inappropriateness filtering after toxicity and obscenity filtration.
40
+
41
+ ![Alt text](.classifier_all_en.pdf?raw=true)
42
+
43
  ## Licensing Information
44
 
45
  [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].