nielsr HF staff commited on
Commit
7886a04
1 Parent(s): c26df70

Update model card

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -12,6 +12,10 @@ datasets:
12
 
13
  Pretrained CANINE model on English language using a masked language modeling (MLM) objective. It was introduced in the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) and first released in [this repository](https://github.com/google-research/language/tree/master/language/canine).
14
 
 
 
 
 
15
  Disclaimer: The team releasing CANINE did not write a model card for this model so this model card has been written by the Hugging Face team.
16
 
17
  ## Model description
@@ -23,8 +27,6 @@ CANINE is a transformers model pretrained on a large corpus of English data in a
23
 
24
  This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the CANINE model as inputs.
25
 
26
- What's special about CANINE is that it doesn't require an explicit tokenizer (such as WordPiece or SentencePiece). Instead, it directly operates at a character level: each character is turned into its [Unicode code point](https://en.wikipedia.org/wiki/Code_point#:~:text=For%20Unicode%2C%20the%20particular%20sequence,forming%20a%20self%2Dsynchronizing%20code.).
27
-
28
  ## Intended uses & limitations
29
 
30
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=canine) to look for fine-tuned versions on a task that interests you.
 
12
 
13
  Pretrained CANINE model on English language using a masked language modeling (MLM) objective. It was introduced in the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) and first released in [this repository](https://github.com/google-research/language/tree/master/language/canine).
14
 
15
+ What's special about CANINE is that it doesn't require an explicit tokenizer (such as WordPiece or SentencePiece) as other models like BERT and RoBERTa. Instead, it directly operates at a character level: each character is turned into its [Unicode code point](https://en.wikipedia.org/wiki/Code_point#:~:text=For%20Unicode%2C%20the%20particular%20sequence,forming%20a%20self%2Dsynchronizing%20code.).
16
+
17
+ This means that input processing is trivial and can typically be accomplished as: `input_ids = [ord(char) for char in text]`, using the built-in ord() function in Python.
18
+
19
  Disclaimer: The team releasing CANINE did not write a model card for this model so this model card has been written by the Hugging Face team.
20
 
21
  ## Model description
 
27
 
28
  This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the CANINE model as inputs.
29
 
 
 
30
  ## Intended uses & limitations
31
 
32
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=canine) to look for fine-tuned versions on a task that interests you.