ValentinB
commited on
Commit
•
c2f17b5
1
Parent(s):
29a6924
Add model files
Browse files- README.md +19 -0
- config.json +25 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
CHANGED
@@ -1,3 +1,22 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
The model generated in the Enrich4All project.<br>
|
5 |
+
Evaluated the perplexity of MLM Task fine-tuned for COVID-related corpus.<br>
|
6 |
+
Baseline model: https://huggingface.co/racai/distilbert-base-romanian-cased <br>
|
7 |
+
Scripts and corpus used for training: https://github.com/racai-ai/e4all-models
|
8 |
+
|
9 |
+
Corpus
|
10 |
+
---------------
|
11 |
+
|
12 |
+
The COVID-19 datasets we designed are a small corpus and a question-answer dataset. The targeted sources were official websites of Romanian institutions involved in managing the COVID-19 pandemic, like The Ministry of Health, Bucharest Public Health Directorate, The National Information Platform on Vaccination against COVID-19, The Ministry of Foreign Affairs, as well as of the European Union. We also harvested the website of a non-profit organization initiative, in partnership with the Romanian Government through the Romanian Digitization Authority, that developed an ample platform with different sections dedicated to COVID-19 official news and recommendations. News websites were avoided due to the volatile character of the continuously changing pandemic situation, but a reliable source of information was a major private medical clinic website (Regina Maria), which provided detailed medical articles on important subjects of immediate interest to the readers and patients, like immunity, the emergent treating protocols or the new Omicron variant of the virus.
|
13 |
+
The corpus dataset was manually collected and revised. Data were checked for grammatical correctness, and missing diacritics were introduced.
|
14 |
+
<br><br>
|
15 |
+
The corpus is structured in 55 UTF-8 documents and contains 147,297 words.
|
16 |
+
|
17 |
+
Results
|
18 |
+
-----------------
|
19 |
+
| MLM Task | Perplexity |
|
20 |
+
| ----------------- | ------------- |
|
21 |
+
| Baseline | 68.39 |
|
22 |
+
| COVID Fine-tuning | 5.56 |
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "racai/distilbert-base-romanian-cased",
|
3 |
+
"activation": "gelu",
|
4 |
+
"architectures": [
|
5 |
+
"DistilBertForMaskedLM"
|
6 |
+
],
|
7 |
+
"attention_dropout": 0.1,
|
8 |
+
"dim": 768,
|
9 |
+
"dropout": 0.1,
|
10 |
+
"hidden_dim": 3072,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"max_position_embeddings": 512,
|
13 |
+
"model_type": "distilbert",
|
14 |
+
"n_heads": 12,
|
15 |
+
"n_layers": 6,
|
16 |
+
"output_hidden_states": true,
|
17 |
+
"pad_token_id": 0,
|
18 |
+
"qa_dropout": 0.1,
|
19 |
+
"seq_classif_dropout": 0.2,
|
20 |
+
"sinusoidal_pos_embds": true,
|
21 |
+
"tie_weights_": true,
|
22 |
+
"torch_dtype": "float32",
|
23 |
+
"transformers_version": "4.15.0",
|
24 |
+
"vocab_size": 50000
|
25 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9dec693ecd54125474b0b546dc0751443a98431015042a1172a3c06451950272
|
3 |
+
size 327897007
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "racai/distilbert-base-romanian-cased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5a25d8378ef9ccf0bc0a8c9a84b7168fd2f2ec9f14b3935a712dc3d6af2373c5
|
3 |
+
size 2927
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|