daviddrzik commited on
Commit
473feb1
1 Parent(s): cf9a7c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -3
README.md CHANGED
@@ -1,3 +1,139 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - sk
5
+ datasets:
6
+ - oscar-corpus/OSCAR-2109
7
+ pipeline_tag: fill-mask
8
+ library_name: transformers
9
+ ---
10
+ # Slovak Morphological Baby Language Model (SK_Morph_BLM)
11
+
12
+ **SK_Morph_BLM** is a pretrained small language model for the Slovak language, based on the RoBERTa architecture. The model utilizes a custom morphological tokenizer specifically designed for the Slovak language, which focuses on **preserving the integrity of root morphemes**. This tokenizer is not compatible with the standard `RobertaTokenizer` from the Hugging Face library due to its unique approach to tokenization. The model is case-insensitive, meaning it operates in lowercase. While the pretrained model can be used for masked language modeling, it is primarily intended for fine-tuning on downstream NLP tasks.
13
+
14
+ ## How to Use the Model
15
+
16
+ To use the SK_Morph_BLM model, follow these steps:
17
+
18
+ ```python
19
+ import torch
20
+ import sys
21
+ from transformers import AutoModelForMaskedLM
22
+ from huggingface_hub import snapshot_download
23
+
24
+ # Download the repository from Hugging Face and append the path to sys.path
25
+ repo_path = snapshot_download(repo_id="daviddrzik/SK_Morph_BLM")
26
+ sys.path.append(repo_path)
27
+
28
+ # Import the custom tokenizer from the downloaded repository
29
+ from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
30
+
31
+ # Initialize the tokenizer and model
32
+ tokenizer = SKMorfoTokenizer()
33
+ model = AutoModelForMaskedLM.from_pretrained("daviddrzik/SK_Morph_BLM")
34
+
35
+ # Function to fill in the masked token in a given text
36
+ def fill_mask(tokenized_text, tokenizer, model, top_k=5):
37
+ inputs = tokenizer.tokenize(tokenized_text.lower(), max_length=256, return_tensors='pt', return_subword=False)
38
+ mask_token_index = torch.where(inputs["input_ids"][0] == 4)[0]
39
+ with torch.no_grad():
40
+ predictions = model(**inputs)
41
+
42
+ topk_tokens = torch.topk(predictions.logits[0, mask_token_index], k=top_k, dim=-1).indices
43
+
44
+ fill_results = []
45
+ for idx, i in enumerate(mask_token_index):
46
+ for j, token_idx in enumerate(topk_tokens[idx]):
47
+ token_text = tokenizer.convert_ids_to_tokens(token_idx.item())
48
+ token_text = token_text.replace("Ġ", " ") # Replace special characters with a space
49
+ probability = torch.softmax(predictions.logits[0, i], dim=-1)[token_idx].item()
50
+ fill_results.append({
51
+ 'score': probability,
52
+ 'token': token_idx.item(),
53
+ 'token_str': token_text,
54
+ 'sequence': tokenized_text.replace("<mask>", token_text.strip())
55
+ })
56
+
57
+ fill_results.sort(key=lambda x: x['score'], reverse=True)
58
+ return fill_results
59
+
60
+ # Example usage of the function
61
+ text = "Včera večer sme <mask> nový film v kine, ktorý mal premiéru iba pred týždňom."
62
+ result = fill_mask(text.lower(), tokenizer, model, top_k=5)
63
+ print(result)
64
+
65
+ [{'score': 0.4014046788215637,
66
+ 'token': 6626,
67
+ 'token_str': ' videli',
68
+ 'sequence': 'včera večer sme videli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
69
+ {'score': 0.15018892288208008,
70
+ 'token': 874,
71
+ 'token_str': ' mali',
72
+ 'sequence': 'včera večer sme mali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
73
+ {'score': 0.057530131191015244,
74
+ 'token': 21193,
75
+ 'token_str': ' pozreli',
76
+ 'sequence': 'včera večer sme pozreli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
77
+ {'score': 0.049020398408174515,
78
+ 'token': 26468,
79
+ 'token_str': ' sledovali',
80
+ 'sequence': 'včera večer sme sledovali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
81
+ {'score': 0.04107135161757469,
82
+ 'token': 9171,
83
+ 'token_str': ' objavili',
84
+ 'sequence': 'včera večer sme objavili nový film v kine, ktorý mal premiéru iba pred týždňom.'}]
85
+ ```
86
+
87
+ ## Training Data
88
+
89
+ The `SK_Morph_BLM` model was pretrained using a subset of the OSCAR 2019 corpus, specifically focusing on the Slovak language. The corpus underwent comprehensive preprocessing to ensure the quality and relevance of the data:
90
+
91
+ - **Language Filtering:** Non-Slovak text was removed to focus solely on the Slovak language.
92
+ - **Character Normalization:** Various types of spaces, quotes, dashes, and separators were standardized (e.g., replacing different types of spaces with a single space, or dashes with hyphens). Emoticons were replaced with spaces.
93
+ - **Symbol and Unwanted Text Removal:** Sentences containing mathematical symbols, pictograms, or characters from Asian and African languages were deleted. Duplicates of punctuation, special characters, and spaces were also removed.
94
+ - **URL and Text Normalization:** All web addresses were removed, and the text was converted to lowercase to simplify tokenization.
95
+ - **Content Cleanup:** Text that included irrelevant content from web crawling, such as keywords and HTML tags, was identified and removed.
96
+
97
+ Additionally, the preprocessing included further refinement steps to create the final dataset:
98
+
99
+ - **Parentheses Content Removal:** All content within parentheses was removed to reduce noise.
100
+ - **Selection of Text Segments:** Medium-length text paragraphs were selected to maintain consistency.
101
+ - **Similarity Filtering:** Paragraphs with at least 50% similarity to previous ones were removed to minimize redundancy.
102
+ - **Random Sampling:** Finally, 20% of the remaining paragraphs were randomly selected.
103
+
104
+ After preprocessing, the training corpus consisted of:
105
+ - **455 MB of text**
106
+ - **895,125 paragraphs**
107
+ - **64.6 million words**
108
+ - **1.13 million unique words**
109
+ - **119 unique characters**
110
+
111
+ ## Pretraining
112
+
113
+ The `SK_Morph_BLM` model was trained with the following key parameters:
114
+
115
+ - **Architecture:** Based on RoBERTa, with 6 hidden layers and 12 attention heads.
116
+ - **Hidden size:** 576
117
+ - **Vocabulary size:** 50,264 tokens
118
+ - **Sequence length:** 256 tokens
119
+ - **Dropout:** 0.1
120
+ - **Number of parameters:** 58 million
121
+ - **Optimizer:** AdamW, learning rate 1×10^(-4), weight decay 0.01
122
+ - **Training:** 30 epochs, divided into 3 phases:
123
+ - **Phase 1:** 10 epochs on CPU (4x AMD EPYC 7542), batch size 64, 50 hours per epoch, 139,870 steps total.
124
+ - **Phase 2:** 5 epochs on GPU (1x Nvidia A100 40GB), batch size 64, 100 minutes per epoch, 69,935 steps total.
125
+ - **Phase 3:** 15 epochs on GPU (2x Nvidia A100 40GB), batch size 128, 60 minutes per epoch, 104,910 steps total.
126
+
127
+ The model was trained using the Hugging Face library, but without using the `Trainer` class—native PyTorch was used instead.
128
+
129
+ ## Fine-Tuned Versions of the SK_Morph_BLM Model
130
+
131
+ Here are the fine-tuned versions of the `SK_Morph_BLM` model based on the folders provided:
132
+
133
+ - [`SK_Morph_BLM-ner`](https://huggingface.co/daviddrzik/SK_Morph_BLM-ner): Fine-tuned for Named Entity Recognition (NER) tasks.
134
+ - [`SK_Morph_BLM-pos`](https://huggingface.co/daviddrzik/SK_Morph_BLM-pos): Fine-tuned for Part-of-Speech (POS) tagging.
135
+ - [`SK_Morph_BLM-qa`](https://huggingface.co/daviddrzik/SK_Morph_BLM-qa): Fine-tuned for Question Answering tasks.
136
+ - [`SK_Morph_BLM-sentiment-csfd`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-csfd): Fine-tuned for sentiment analysis on the CSFD (movie review) dataset.
137
+ - [`SK_Morph_BLM-sentiment-multidomain`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-multidomain): Fine-tuned for sentiment analysis across multiple domains.
138
+ - [`SK_Morph_BLM-sentiment-reviews`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-reviews): Fine-tuned for sentiment analysis on general review datasets.
139
+ - [`SK_Morph_BLM-topic-news`](https://huggingface.co/daviddrzik/SK_Morph_BLM-topic-news): Fine-tuned for topic classification in news articles.