Update model card
Browse files
README.md
CHANGED
@@ -12,6 +12,10 @@ datasets:
|
|
12 |
|
13 |
Pretrained CANINE model on English language using a masked language modeling (MLM) objective. It was introduced in the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) and first released in [this repository](https://github.com/google-research/language/tree/master/language/canine).
|
14 |
|
|
|
|
|
|
|
|
|
15 |
Disclaimer: The team releasing CANINE did not write a model card for this model so this model card has been written by the Hugging Face team.
|
16 |
|
17 |
## Model description
|
@@ -23,8 +27,6 @@ CANINE is a transformers model pretrained on a large corpus of English data in a
|
|
23 |
|
24 |
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the CANINE model as inputs.
|
25 |
|
26 |
-
What's special about CANINE is that it doesn't require an explicit tokenizer (such as WordPiece or SentencePiece). Instead, it directly operates at a character level: each character is turned into its [Unicode code point](https://en.wikipedia.org/wiki/Code_point#:~:text=For%20Unicode%2C%20the%20particular%20sequence,forming%20a%20self%2Dsynchronizing%20code.).
|
27 |
-
|
28 |
## Intended uses & limitations
|
29 |
|
30 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=canine) to look for fine-tuned versions on a task that interests you.
|
|
|
12 |
|
13 |
Pretrained CANINE model on English language using a masked language modeling (MLM) objective. It was introduced in the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) and first released in [this repository](https://github.com/google-research/language/tree/master/language/canine).
|
14 |
|
15 |
+
What's special about CANINE is that it doesn't require an explicit tokenizer (such as WordPiece or SentencePiece) as other models like BERT and RoBERTa. Instead, it directly operates at a character level: each character is turned into its [Unicode code point](https://en.wikipedia.org/wiki/Code_point#:~:text=For%20Unicode%2C%20the%20particular%20sequence,forming%20a%20self%2Dsynchronizing%20code.).
|
16 |
+
|
17 |
+
This means that input processing is trivial and can typically be accomplished as: `input_ids = [ord(char) for char in text]`, using the built-in ord() function in Python.
|
18 |
+
|
19 |
Disclaimer: The team releasing CANINE did not write a model card for this model so this model card has been written by the Hugging Face team.
|
20 |
|
21 |
## Model description
|
|
|
27 |
|
28 |
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the CANINE model as inputs.
|
29 |
|
|
|
|
|
30 |
## Intended uses & limitations
|
31 |
|
32 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=canine) to look for fine-tuned versions on a task that interests you.
|