Release version 1.1 of RobeCzech.

In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
mapping all tokens to a unique ID. That also required increasing the
vocabulary sizes and embeddings weights (by replicating the embedding of the
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
the same results on any input, and the tokens in version 1.0 that mapped to
a different ID than the `[UNK]` token map to the same ID in version 1.1.

However, the sizes of the embeddings (and LM head weights and biases) are
different, so the weights of the version 1.1 are not compatible with the
configuration of version 1.0 and vice versa.

Files changed (7) hide show

README.md +32 -1
config.json +1 -1
model.safetensors +2 -2
pytorch_model.bin +2 -2
tf_model.h5 +2 -2
tokenizer.json +0 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -11,7 +11,38 @@ tags:
 # Model Card for RobeCzech
-**If you are having issues with the tokenizer, please see https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314.**
 # Model Details

 # Model Card for RobeCzech
+## Version History
+- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
+  tokenizer; the weights are unmodified.
+  The tokenizer in the initial release (a) contained a hole (51959 did not
+  correspond to any token), and (b) mapped several tokens (unseen during training
+  but required by the BBPE tokenizer) to the same ID as the `[UNK]` token (3).
+  That sometimes caused problems, as in https://huggingface.co/ufal/robeczech-base/discussions/4.
+  See https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314
+  for more information.
+  In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
+  mapping all tokens to a unique ID. That also required increasing the
+  vocabulary sizes and embeddings weights (by replicating the embedding of the
+  `[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
+  the same results on any input, and the tokens in version 1.0 that mapped to
+  a different ID than the `[UNK]` token map to the same ID in version 1.1.
+  However, the sizes of the embeddings (and LM head weights and biases) are
+  different, so the weights of the version 1.1 are not compatible with the
+  configuration of version 1.0 and vice versa.
+- **version 1.0**: Initial version released in May 2021 (with the tokenization
+  issues described above).
+  If you want to load a pretrained model, configuration, or a tokenizer of
+  version 1.0, you can use
+  ```python
+  from_pretrained("ufal/robeczech-base", revision="v1.0")
+  ```
+  to create an `AutoModel`, an `AutoConfig`, or an `AutoTokenizer`.
 # Model Details

config.json CHANGED Viewed

@@ -18,5 +18,5 @@
   "num_hidden_layers": 12,
   "pad_token_id": 1,
   "type_vocab_size": 1,
-  "vocab_size": 51961
 }

   "num_hidden_layers": 12,
   "pad_token_id": 1,
   "type_vocab_size": 1,
+  "vocab_size": 51997
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8f1fe5a8d6de3910c79af9ef587453aafc3a098ba131521c23d092af6f65e8ee
-size 506605544

 version https://git-lfs.github.com/spec/v1
+oid sha256:1471bbf101e701ff889b87a5dd0dc87bd862e688b77f30fdb5c78f571750a1b9
+size 504141580

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6ada3c49cf56eda3228362a987ef813d4cec59bbb70730237d0d86bad6f0111c
-size 506663689

 version https://git-lfs.github.com/spec/v1
+oid sha256:d352e06e70edb7626a1dfd59ad2481f9c7d01996d6c8b137b1ad9c38ae6c5a37
+size 504184434

tf_model.h5 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a135a8b68b32fba27b5950b203787cfb5f06b714122ecc216b1b2a83808e27c0
-size 667860748

 version https://git-lfs.github.com/spec/v1
+oid sha256:70e3cc7e936ff540d2d8e345ddfe7c40662f40732941d7bfee6b1125ac0b5e7a
+size 665719492

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff