Edit model card

Model Overview

This model is a fine-tuned version of the cmarkea/distilcamembert-base-ner, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like locations, organizations, persons, and other miscellaneous entities in French texts.

Model Type

  • Architecture: CamembertForTokenClassification
  • Base Model: DistilCamemBERT
  • Number of Layers: 6 hidden layers, 12 attention heads
  • Tokenizer: Based on CamemBERT's tokenizer
  • Vocab Size: 32,005 tokens

Intended Use

This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as:

  • LOC (Location)
  • PER (Person)
  • ORG (Organization)
  • MISC (Miscellaneous) It can also identify the Starting city and the Ending city of a travel

Example Use Case:

Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:

  • Paris as LOC
  • Lyon as LOC

Limitations:

  • Language: The model is primarily designed for French texts.
  • Performance: Performance may degrade if used for non-French text or tasks outside NER.

Labels and Tokens

The model uses the following entity labels:

  • O: Outside any named entity
  • B-START: Beginning of a named entity (start location)
  • I-START: Inside a named entity (start location)
  • B-END: Beginning of a named entity (end location)
  • I-END: Inside a named entity (end location)

Training Data

The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names.

Hyperparameters and Fine-Tuning:

  • Learning Rate: 2e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Strategy: Epoch-based
  • Optimizer: AdamW
  • Early Stopping: Used to prevent overfitting

Tokenizer

The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:

  • Max Length: 128
  • Padding: Right-padded to 128 tokens
  • Truncation: Longest-first strategy, truncating tokens beyond 128.

How to Use

You can load and use this model with Hugging Face’s transformers library as follows:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")

text = "Je veux aller de Paris à Lyon"
tokens = tokenizer(text, return_tensors="pt")
outputs = model(**tokens)

Limitations and Bias

  • The model may not generalize well beyond French texts.
  • Results may be biased towards specific named entities frequently seen in the training data (such as city names).

License

This model is released under the Apache 2.0 License.

Downloads last month
6
Safetensors
Model size
67.5M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for Crysy-rthomas/T-AIA-CamemBERT-NER-V2

Finetuned
(2)
this model

Dataset used to train Crysy-rthomas/T-AIA-CamemBERT-NER-V2