File size: 2,544 Bytes
2bdc67a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#### 1D-CNN-MC-toxicity-classifier-ru
(One-Dimensional Convolutional Neural Network with Multi-Channel input)

Architectural visualization:

![](https://i.imgur.com/skbLM6w.png)

Total parameters: 503249

##### Test Accuracy: 94.44%
##### Training Accuracy: 97.46%

This model is developed for binary classification of Cyrillic text.

##### A dataset of 75093 negative rows and 75093 positive rows was used for training.

##### Recommended length of the input sequence: 25 - 400 Cyrillic characters.

##### Simplifications of the dataset strings: 
Removing extra spaces.

Replacing capital letters with small letters. (Я -> я).

Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc)

Replacing ё with e.

### Example of use:

    import numpy as np
    from tensorflow import keras
    from tensorflow.keras.preprocessing.text import tokenizer_from_json
    from safetensors.numpy import load_file
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    import os
    import re
    # Название папки, где хранится модель
    model_dir = 'model'
    max_len = 400
    # Загрузка архитектуры модели
    with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file:
        model_json = json_file.read()
    model = keras.models.model_from_json(model_json)
    # Загрузка весов из safetensors
    state_dict = load_file(os.path.join(model_dir, 'model_weights.safetensors'))
    weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))]
    model.set_weights(weights)
    # Загрузка токенизатора
    with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f:
        tokenizer_json = f.read()
    tokenizer = tokenizer_from_json(tokenizer_json)
    def predict_toxicity(text):
        sequences = tokenizer.texts_to_sequences([text])
        padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
        probability = model.predict(padded)[0][0]
        class_label = "toxic" if probability >= 0.5 else "normal"
        return class_label, probability
    # Пример использования
    text = "Да какой идиот сделал эту НС?"
    class_label, probability = predict_toxicity(text)
    print(f"Text: {text}")
    print(f"Class: {class_label} ({probability:.2%})")

###### Output: 
Text: Да какой идиот сделал эту НС?
Class: toxic (99.35%)