maltehb commited on
Commit
4459c08
2 Parent(s): 6338e91 16abb88

Merge branch 'main' of https://huggingface.co/flax-community/roberta-base-danish into main

Browse files
.gitignore ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ pip-wheel-metadata/
24
+ share/python-wheels/
25
+ *.egg-info/
26
+ .installed.cfg
27
+ *.egg
28
+ MANIFEST
29
+
30
+ # PyInstaller
31
+ # Usually these files are written by a python script from a template
32
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
33
+ *.manifest
34
+ *.spec
35
+
36
+ # Installer logs
37
+ pip-log.txt
38
+ pip-delete-this-directory.txt
39
+
40
+ # Unit test / coverage reports
41
+ htmlcov/
42
+ .tox/
43
+ .nox/
44
+ .coverage
45
+ .coverage.*
46
+ .cache
47
+ nosetests.xml
48
+ coverage.xml
49
+ *.cover
50
+ *.py,cover
51
+ .hypothesis/
52
+ .pytest_cache/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ target/
76
+
77
+ # Jupyter Notebook
78
+ .ipynb_checkpoints
79
+
80
+ # IPython
81
+ profile_default/
82
+ ipython_config.py
83
+
84
+ # pyenv
85
+ .python-version
86
+
87
+ # pipenv
88
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
90
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
91
+ # install all needed dependencies.
92
+ #Pipfile.lock
93
+
94
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95
+ __pypackages__/
96
+
97
+ # Celery stuff
98
+ celerybeat-schedule
99
+ celerybeat.pid
100
+
101
+ # SageMath parsed files
102
+ *.sage.py
103
+
104
+ # Environments
105
+ .env
106
+ .venv
107
+ env/
108
+ venv/
109
+ ENV/
110
+ env.bak/
111
+ venv.bak/
112
+
113
+ # Spyder project settings
114
+ .spyderproject
115
+ .spyproject
116
+
117
+ # Rope project settings
118
+ .ropeproject
119
+
120
+ # mkdocs documentation
121
+ /site
122
+
123
+ # mypy
124
+ .mypy_cache/
125
+ .dmypy.json
126
+ dmypy.json
127
+
128
+ # Pyre type checker
129
+ .pyre/
130
+
131
+ # VIm
132
+ **/.vim
133
+ **/.swp
134
+ **/.swo
README.md CHANGED
@@ -18,4 +18,4 @@ This is a sample reference model for Flax/Jax training using only on the MC4. It
18
 
19
  ---
20
  ## Description
21
- My description
18
 
19
  ---
20
  ## Description
21
+ My description
config.json CHANGED
@@ -24,4 +24,4 @@
24
  "type_vocab_size": 1,
25
  "use_cache": true,
26
  "vocab_size": 50265
27
- }
24
  "type_vocab_size": 1,
25
  "use_cache": true,
26
  "vocab_size": 50265
27
+ }
continue_run_mlm_flax_stream.sh CHANGED
@@ -23,4 +23,4 @@ python3 ./src/run_mlm_flax_stream.py \
23
  --eval_steps="1000" \
24
  #--push_to_hub \
25
  #--config_name="${MODEL_DIR}" \
26
- #--model_type="roberta" \
23
  --eval_steps="1000" \
24
  #--push_to_hub \
25
  #--config_name="${MODEL_DIR}" \
26
+ #--model_type="roberta" \
md_logs/train_tokenizer.md CHANGED
@@ -67,4 +67,4 @@ python3 train_tokenizer.py
67
  ### Problems while developing the script:
68
  - Loading the '*mc4*' dataset using the `load_dataset()` from HugginFace's dataset package `datasets` was not able to load multiple language in one line of code, as otherwise specified [here](https://huggingface.co/datasets/mc4). It was thus chosen to load each language and concatenate them.
69
  - Furthermore, it seems that even though you predefine a subset-split using the `split` argument, the entire dataset still needs to be downloaded.
70
- - Some bug occured when downloading the danish dataset, and we then had to force a redownload to mitigate the bug, and make the VM download it.
67
  ### Problems while developing the script:
68
  - Loading the '*mc4*' dataset using the `load_dataset()` from HugginFace's dataset package `datasets` was not able to load multiple language in one line of code, as otherwise specified [here](https://huggingface.co/datasets/mc4). It was thus chosen to load each language and concatenate them.
69
  - Furthermore, it seems that even though you predefine a subset-split using the `split` argument, the entire dataset still needs to be downloaded.
70
+ - Some bug occured when downloading the danish dataset, and we then had to force a redownload to mitigate the bug, and make the VM download it.
special_tokens_map.json CHANGED
@@ -1 +1,15 @@
1
- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "unk_token": "<unk>",
5
+ "sep_token": "</s>",
6
+ "pad_token": "<pad>",
7
+ "cls_token": "<s>",
8
+ "mask_token": {
9
+ "content": "<mask>",
10
+ "single_word": false,
11
+ "lstrip": true,
12
+ "rstrip": false,
13
+ "normalized": false
14
+ }
15
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
tokenizer_config.json CHANGED
@@ -1 +1,14 @@
1
- {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": false, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "name_or_path": "./", "tokenizer_class": "RobertaTokenizer"}
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "unk_token": "<unk>",
3
+ "bos_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "add_prefix_space": false,
6
+ "errors": "replace",
7
+ "sep_token": "</s>",
8
+ "cls_token": "<s>",
9
+ "pad_token": "<pad>",
10
+ "mask_token": "<mask>",
11
+ "special_tokens_map_file": null,
12
+ "name_or_path": "./",
13
+ "tokenizer_class": "RobertaTokenizer"
14
+ }
vocab.json CHANGED
The diff for this file is too large to render. See raw diff