daviddrzik
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,114 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- sk
|
5 |
+
pipeline_tag: text-classification
|
6 |
+
library_name: transformers
|
7 |
+
metrics:
|
8 |
+
- f1
|
9 |
+
base_model: daviddrzik/SK_Morph_BLM
|
10 |
+
tags:
|
11 |
+
- topic
|
12 |
+
---
|
13 |
+
|
14 |
+
# Fine-Tuned Topic Classification Model - SK_Morph_BLM (Topic News)
|
15 |
+
|
16 |
+
## Model Overview
|
17 |
+
This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for topic classification. For this task, we used the Slovak Categorized News Corpus, which contains news articles divided into six categories: Economy and Business, Culture, News, World, Sports, and Healthcare. The corpus provides text files with detailed annotations, including token and sentence boundary identification, stop words, morphological analysis, named entity recognition, and lemmatization.
|
18 |
+
|
19 |
+
## Topic Labels
|
20 |
+
Each record in the dataset is labeled with one of the following topics:
|
21 |
+
- **Healthcare (0):** 2,564 records
|
22 |
+
- **News (1):** 4,174 records
|
23 |
+
- **Sports (2):** 2,759 records
|
24 |
+
- **World (3):** 1,660 records
|
25 |
+
- **Economy and Business (4):** 4,199 records
|
26 |
+
- **Culture (5):** 137 records
|
27 |
+
## Dataset Details
|
28 |
+
The original corpus did not contain continuous text, requiring significant preprocessing. The process involved:
|
29 |
+
1. **Reconstruction:** We reconstructed coherent text from individual annotated files, resulting in over 86,000 sentences.
|
30 |
+
2. **Combining Sentences:** Sentences from each file were combined into single records, with a maximum length of 600 characters (approximately 200 tokens).
|
31 |
+
The final dataset comprises a total of 15,493 records, each labeled according to the categories listed above.
|
32 |
+
|
33 |
+
For more information about the dataset, please visit [this link]( https://nlp.kemt.fei.tuke.sk/language/categorizednews).
|
34 |
+
|
35 |
+
## Fine-Tuning Hyperparameters
|
36 |
+
|
37 |
+
The following hyperparameters were used during the fine-tuning process:
|
38 |
+
|
39 |
+
- **Learning Rate:** 1e-05
|
40 |
+
- **Training Batch Size:** 64
|
41 |
+
- **Evaluation Batch Size:** 64
|
42 |
+
- **Seed:** 42
|
43 |
+
- **Optimizer:** Adam (default)
|
44 |
+
- **Number of Epochs:** 10
|
45 |
+
|
46 |
+
## Model Performance
|
47 |
+
|
48 |
+
The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.968**</span> across the folds.
|
49 |
+
|
50 |
+
## Model Usage
|
51 |
+
|
52 |
+
This model is suitable for topic classification in Slovak text, particularly for news articles across various categories. It is specifically designed for applications requiring topic categorization of news content and may not generalize well to other types of text.
|
53 |
+
|
54 |
+
### Example Usage
|
55 |
+
|
56 |
+
Below is an example of how to use the fine-tuned `SK_Morph_BLM-topic-news ` model in a Python script:
|
57 |
+
|
58 |
+
```python
|
59 |
+
import torch
|
60 |
+
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
|
61 |
+
from huggingface_hub import snapshot_download
|
62 |
+
|
63 |
+
class TopicClassifier:
|
64 |
+
def __init__(self, tokenizer, model):
|
65 |
+
self.model = RobertaForSequenceClassification.from_pretrained(model, num_labels=6)
|
66 |
+
|
67 |
+
repo_path = snapshot_download(repo_id = tokenizer)
|
68 |
+
sys.path.append(repo_path)
|
69 |
+
|
70 |
+
# Import the custom tokenizer from the downloaded repository
|
71 |
+
from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
|
72 |
+
self.tokenizer = SKMorfoTokenizer()
|
73 |
+
|
74 |
+
def tokenize_text(self, text):
|
75 |
+
encoded_text = self.tokenizer.tokenize(text.lower(), max_length=256, return_tensors='pt', return_subword=False)
|
76 |
+
return encoded_text
|
77 |
+
|
78 |
+
def classify_text(self, encoded_text):
|
79 |
+
with torch.no_grad():
|
80 |
+
output = self.model(**encoded_text)
|
81 |
+
logits = output.logits
|
82 |
+
predicted_class = torch.argmax(logits, dim=1).item()
|
83 |
+
probabilities = torch.softmax(logits, dim=1)
|
84 |
+
class_probabilities = probabilities[0].tolist()
|
85 |
+
predicted_class_text = self.model.config.id2label[predicted_class]
|
86 |
+
return predicted_class, predicted_class_text, class_probabilities
|
87 |
+
|
88 |
+
# Instantiate the topic classifier with the specified tokenizer and model
|
89 |
+
classifier = TopicClassifier(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-topic-news")
|
90 |
+
|
91 |
+
# Example text to classify topic
|
92 |
+
text_to_classify = "Tento dôležitý zápas medzi Českou republikou a Švajčiarskom sa po troch tretinách skončil 2:0."
|
93 |
+
print("Text to classify: " + text_to_classify + "\n")
|
94 |
+
|
95 |
+
# Tokenize the input text
|
96 |
+
encoded_text = classifier.tokenize_text(text_to_classify)
|
97 |
+
|
98 |
+
# Classify the topic of the tokenized text
|
99 |
+
predicted_class, predicted_class_text, logits = classifier.classify_text(encoded_text)
|
100 |
+
|
101 |
+
# Print the predicted class label and index
|
102 |
+
print(f"Predicted class: {predicted_class_text} ({predicted_class})")
|
103 |
+
# Print the probabilities for each class
|
104 |
+
print(f"Class probabilities: {logits}")
|
105 |
+
```
|
106 |
+
|
107 |
+
Example Output
|
108 |
+
Here is the output when running the above example:
|
109 |
+
```yaml
|
110 |
+
Text to classify: Tento dôležitý zápas medzi Českou republikou a Švajčiarskom sa po troch tretinách skončil 2:0.
|
111 |
+
|
112 |
+
Predicted class: Sport (2)
|
113 |
+
Class probabilities: [0.00015074180555529892, 0.000343791936757043, 0.9958429932594299, 0.0015455043176189065, 0.0013796273851767182, 0.000737304100766778]
|
114 |
+
```
|