CamemBERT¶
The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text.
The abstract from the paper is the following:
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. Aiming to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.
Tips:
This implementation is the same as RoBERTa. Refer to the documentation of RoBERTa for usage examples as well as the information relative to the inputs and outputs.
The original code can be found here.
CamembertConfig¶
-
class
transformers.
CamembertConfig
(pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)[source]¶ This class overrides
RobertaConfig
. Please check the superclass for the appropriate documentation alongside usage examples.
CamembertTokenizer¶
-
class
transformers.
CamembertTokenizer
(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', additional_special_tokens=['<s>NOTUSED', '</s>NOTUSED'], **kwargs)[source]¶ Adapted from RobertaTokenizer and XLNetTokenizer SentencePiece based tokenizer. Peculiarities:
requires SentencePiece
This tokenizer inherits from
PreTrainedTokenizer
which contains most of the methods. Users should refer to the superclass for more information regarding methods.- Parameters
vocab_file (
str
) – Path to the vocabulary file.bos_token (
string
, optional, defaults to “<s>”) –The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
Note
When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
.eos_token (
string
, optional, defaults to “</s>”) –The end of sequence token.
Note
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
.sep_token (
string
, optional, defaults to “</s>”) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.cls_token (
string
, optional, defaults to “<s>”) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.unk_token (
string
, optional, defaults to “<unk>”) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.pad_token (
string
, optional, defaults to “<pad>”) – The token used for padding, for example when batching sequences of different lengths.mask_token (
string
, optional, defaults to “<mask>”) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.additional_special_tokens (
List[str]
, optional, defaults to["<s>NOTUSED", "</s>NOTUSED"]
) – Additional special tokens used by the tokenizer.
-
sp_model
¶ The SentencePiece processor that is used for every conversion (string, tokens and IDs).
- Type
SentencePieceProcessor
-
build_inputs_with_special_tokens
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CamemBERT sequence has the following format:
single sequence:
<s> X </s>
pair of sequences:
<s> A </s></s> B </s>
- Parameters
token_ids_0 (
List[int]
) – List of IDs to which the special tokens will be addedtoken_ids_1 (
List[int]
, optional, defaults toNone
) – Optional second list of IDs for sequence pairs.
- Returns
list of input IDs with the appropriate special tokens.
- Return type
List[int]
-
create_token_type_ids_from_sequences
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.
- Parameters
token_ids_0 (
List[int]
) – List of ids.token_ids_1 (
List[int]
, optional, defaults toNone
) – Optional second list of IDs for sequence pairs.
- Returns
List of zeros.
- Return type
List[int]
-
get_special_tokens_mask
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model
orencode_plus
methods.- Parameters
token_ids_0 (
List[int]
) – List of ids.token_ids_1 (
List[int]
, optional, defaults toNone
) – Optional second list of IDs for sequence pairs.already_has_special_tokens (
bool
, optional, defaults toFalse
) – Set to True if the token list is already formatted with special tokens for the model
- Returns
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- Return type
List[int]
CamembertModel¶
-
class
transformers.
CamembertModel
(config)[source]¶ The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
RobertaModel
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
CamembertForMaskedLM¶
-
class
transformers.
CamembertForMaskedLM
(config)[source]¶ CamemBERT Model with a language modeling head on top.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
RobertaForMaskedLM
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
CamembertForSequenceClassification¶
-
class
transformers.
CamembertForSequenceClassification
(config)[source]¶ CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
RobertaForSequenceClassification
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
CamembertForMultipleChoice¶
-
class
transformers.
CamembertForMultipleChoice
(config)[source]¶ CamemBERT Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
RobertaForMultipleChoice
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
CamembertForTokenClassification¶
-
class
transformers.
CamembertForTokenClassification
(config)[source]¶ CamemBERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
RobertaForTokenClassification
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
TFCamembertModel¶
-
class
transformers.
TFCamembertModel
(*args, **kwargs)[source]¶ The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with input_ids only and nothing else:
model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({'input_ids': input_ids, 'token_type_ids': token_type_ids})
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
TFRobertaModel
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
TFCamembertForMaskedLM¶
-
class
transformers.
TFCamembertForMaskedLM
(*args, **kwargs)[source]¶ CamemBERT Model with a language modeling head on top.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with input_ids only and nothing else:
model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({'input_ids': input_ids, 'token_type_ids': token_type_ids})
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
TFRobertaForMaskedLM
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
TFCamembertForSequenceClassification¶
-
class
transformers.
TFCamembertForSequenceClassification
(*args, **kwargs)[source]¶ CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with input_ids only and nothing else:
model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({'input_ids': input_ids, 'token_type_ids': token_type_ids})
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
TFRobertaForSequenceClassification
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig
TFCamembertForTokenClassification¶
-
class
transformers.
TFCamembertForTokenClassification
(*args, **kwargs)[source]¶ CamemBERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with input_ids only and nothing else:
model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({'input_ids': input_ids, 'token_type_ids': token_type_ids})
- Parameters
config (
CamembertConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
This class overrides
TFRobertaForTokenClassification
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_camembert.CamembertConfig