daviddrzik
commited on
Commit
•
473feb1
1
Parent(s):
cf9a7c2
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,139 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- sk
|
5 |
+
datasets:
|
6 |
+
- oscar-corpus/OSCAR-2109
|
7 |
+
pipeline_tag: fill-mask
|
8 |
+
library_name: transformers
|
9 |
+
---
|
10 |
+
# Slovak Morphological Baby Language Model (SK_Morph_BLM)
|
11 |
+
|
12 |
+
**SK_Morph_BLM** is a pretrained small language model for the Slovak language, based on the RoBERTa architecture. The model utilizes a custom morphological tokenizer specifically designed for the Slovak language, which focuses on **preserving the integrity of root morphemes**. This tokenizer is not compatible with the standard `RobertaTokenizer` from the Hugging Face library due to its unique approach to tokenization. The model is case-insensitive, meaning it operates in lowercase. While the pretrained model can be used for masked language modeling, it is primarily intended for fine-tuning on downstream NLP tasks.
|
13 |
+
|
14 |
+
## How to Use the Model
|
15 |
+
|
16 |
+
To use the SK_Morph_BLM model, follow these steps:
|
17 |
+
|
18 |
+
```python
|
19 |
+
import torch
|
20 |
+
import sys
|
21 |
+
from transformers import AutoModelForMaskedLM
|
22 |
+
from huggingface_hub import snapshot_download
|
23 |
+
|
24 |
+
# Download the repository from Hugging Face and append the path to sys.path
|
25 |
+
repo_path = snapshot_download(repo_id="daviddrzik/SK_Morph_BLM")
|
26 |
+
sys.path.append(repo_path)
|
27 |
+
|
28 |
+
# Import the custom tokenizer from the downloaded repository
|
29 |
+
from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
|
30 |
+
|
31 |
+
# Initialize the tokenizer and model
|
32 |
+
tokenizer = SKMorfoTokenizer()
|
33 |
+
model = AutoModelForMaskedLM.from_pretrained("daviddrzik/SK_Morph_BLM")
|
34 |
+
|
35 |
+
# Function to fill in the masked token in a given text
|
36 |
+
def fill_mask(tokenized_text, tokenizer, model, top_k=5):
|
37 |
+
inputs = tokenizer.tokenize(tokenized_text.lower(), max_length=256, return_tensors='pt', return_subword=False)
|
38 |
+
mask_token_index = torch.where(inputs["input_ids"][0] == 4)[0]
|
39 |
+
with torch.no_grad():
|
40 |
+
predictions = model(**inputs)
|
41 |
+
|
42 |
+
topk_tokens = torch.topk(predictions.logits[0, mask_token_index], k=top_k, dim=-1).indices
|
43 |
+
|
44 |
+
fill_results = []
|
45 |
+
for idx, i in enumerate(mask_token_index):
|
46 |
+
for j, token_idx in enumerate(topk_tokens[idx]):
|
47 |
+
token_text = tokenizer.convert_ids_to_tokens(token_idx.item())
|
48 |
+
token_text = token_text.replace("Ġ", " ") # Replace special characters with a space
|
49 |
+
probability = torch.softmax(predictions.logits[0, i], dim=-1)[token_idx].item()
|
50 |
+
fill_results.append({
|
51 |
+
'score': probability,
|
52 |
+
'token': token_idx.item(),
|
53 |
+
'token_str': token_text,
|
54 |
+
'sequence': tokenized_text.replace("<mask>", token_text.strip())
|
55 |
+
})
|
56 |
+
|
57 |
+
fill_results.sort(key=lambda x: x['score'], reverse=True)
|
58 |
+
return fill_results
|
59 |
+
|
60 |
+
# Example usage of the function
|
61 |
+
text = "Včera večer sme <mask> nový film v kine, ktorý mal premiéru iba pred týždňom."
|
62 |
+
result = fill_mask(text.lower(), tokenizer, model, top_k=5)
|
63 |
+
print(result)
|
64 |
+
|
65 |
+
[{'score': 0.4014046788215637,
|
66 |
+
'token': 6626,
|
67 |
+
'token_str': ' videli',
|
68 |
+
'sequence': 'včera večer sme videli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
|
69 |
+
{'score': 0.15018892288208008,
|
70 |
+
'token': 874,
|
71 |
+
'token_str': ' mali',
|
72 |
+
'sequence': 'včera večer sme mali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
|
73 |
+
{'score': 0.057530131191015244,
|
74 |
+
'token': 21193,
|
75 |
+
'token_str': ' pozreli',
|
76 |
+
'sequence': 'včera večer sme pozreli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
|
77 |
+
{'score': 0.049020398408174515,
|
78 |
+
'token': 26468,
|
79 |
+
'token_str': ' sledovali',
|
80 |
+
'sequence': 'včera večer sme sledovali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
|
81 |
+
{'score': 0.04107135161757469,
|
82 |
+
'token': 9171,
|
83 |
+
'token_str': ' objavili',
|
84 |
+
'sequence': 'včera večer sme objavili nový film v kine, ktorý mal premiéru iba pred týždňom.'}]
|
85 |
+
```
|
86 |
+
|
87 |
+
## Training Data
|
88 |
+
|
89 |
+
The `SK_Morph_BLM` model was pretrained using a subset of the OSCAR 2019 corpus, specifically focusing on the Slovak language. The corpus underwent comprehensive preprocessing to ensure the quality and relevance of the data:
|
90 |
+
|
91 |
+
- **Language Filtering:** Non-Slovak text was removed to focus solely on the Slovak language.
|
92 |
+
- **Character Normalization:** Various types of spaces, quotes, dashes, and separators were standardized (e.g., replacing different types of spaces with a single space, or dashes with hyphens). Emoticons were replaced with spaces.
|
93 |
+
- **Symbol and Unwanted Text Removal:** Sentences containing mathematical symbols, pictograms, or characters from Asian and African languages were deleted. Duplicates of punctuation, special characters, and spaces were also removed.
|
94 |
+
- **URL and Text Normalization:** All web addresses were removed, and the text was converted to lowercase to simplify tokenization.
|
95 |
+
- **Content Cleanup:** Text that included irrelevant content from web crawling, such as keywords and HTML tags, was identified and removed.
|
96 |
+
|
97 |
+
Additionally, the preprocessing included further refinement steps to create the final dataset:
|
98 |
+
|
99 |
+
- **Parentheses Content Removal:** All content within parentheses was removed to reduce noise.
|
100 |
+
- **Selection of Text Segments:** Medium-length text paragraphs were selected to maintain consistency.
|
101 |
+
- **Similarity Filtering:** Paragraphs with at least 50% similarity to previous ones were removed to minimize redundancy.
|
102 |
+
- **Random Sampling:** Finally, 20% of the remaining paragraphs were randomly selected.
|
103 |
+
|
104 |
+
After preprocessing, the training corpus consisted of:
|
105 |
+
- **455 MB of text**
|
106 |
+
- **895,125 paragraphs**
|
107 |
+
- **64.6 million words**
|
108 |
+
- **1.13 million unique words**
|
109 |
+
- **119 unique characters**
|
110 |
+
|
111 |
+
## Pretraining
|
112 |
+
|
113 |
+
The `SK_Morph_BLM` model was trained with the following key parameters:
|
114 |
+
|
115 |
+
- **Architecture:** Based on RoBERTa, with 6 hidden layers and 12 attention heads.
|
116 |
+
- **Hidden size:** 576
|
117 |
+
- **Vocabulary size:** 50,264 tokens
|
118 |
+
- **Sequence length:** 256 tokens
|
119 |
+
- **Dropout:** 0.1
|
120 |
+
- **Number of parameters:** 58 million
|
121 |
+
- **Optimizer:** AdamW, learning rate 1×10^(-4), weight decay 0.01
|
122 |
+
- **Training:** 30 epochs, divided into 3 phases:
|
123 |
+
- **Phase 1:** 10 epochs on CPU (4x AMD EPYC 7542), batch size 64, 50 hours per epoch, 139,870 steps total.
|
124 |
+
- **Phase 2:** 5 epochs on GPU (1x Nvidia A100 40GB), batch size 64, 100 minutes per epoch, 69,935 steps total.
|
125 |
+
- **Phase 3:** 15 epochs on GPU (2x Nvidia A100 40GB), batch size 128, 60 minutes per epoch, 104,910 steps total.
|
126 |
+
|
127 |
+
The model was trained using the Hugging Face library, but without using the `Trainer` class—native PyTorch was used instead.
|
128 |
+
|
129 |
+
## Fine-Tuned Versions of the SK_Morph_BLM Model
|
130 |
+
|
131 |
+
Here are the fine-tuned versions of the `SK_Morph_BLM` model based on the folders provided:
|
132 |
+
|
133 |
+
- [`SK_Morph_BLM-ner`](https://huggingface.co/daviddrzik/SK_Morph_BLM-ner): Fine-tuned for Named Entity Recognition (NER) tasks.
|
134 |
+
- [`SK_Morph_BLM-pos`](https://huggingface.co/daviddrzik/SK_Morph_BLM-pos): Fine-tuned for Part-of-Speech (POS) tagging.
|
135 |
+
- [`SK_Morph_BLM-qa`](https://huggingface.co/daviddrzik/SK_Morph_BLM-qa): Fine-tuned for Question Answering tasks.
|
136 |
+
- [`SK_Morph_BLM-sentiment-csfd`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-csfd): Fine-tuned for sentiment analysis on the CSFD (movie review) dataset.
|
137 |
+
- [`SK_Morph_BLM-sentiment-multidomain`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-multidomain): Fine-tuned for sentiment analysis across multiple domains.
|
138 |
+
- [`SK_Morph_BLM-sentiment-reviews`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-reviews): Fine-tuned for sentiment analysis on general review datasets.
|
139 |
+
- [`SK_Morph_BLM-topic-news`](https://huggingface.co/daviddrzik/SK_Morph_BLM-topic-news): Fine-tuned for topic classification in news articles.
|