---
license: mit
language:
- fr
base_model:
- cmarkea/distilcamembert-base
datasets:
- Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET
---
## Model Overview
This model is a fine-tuned version of the **[cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base)**, adapted for **binary text classification** in French.

### Model Type
- **Architecture**: `CamembertForSequenceClassification`
- **Base Model**: DistilCamemBERT
- **Number of Layers**: 6 hidden layers, 12 attention heads
- **Tokenizer**: Based on CamemBERT's tokenizer
- **Vocab Size**: 32,005 tokens

## Intended Use
This model is designed for classifying sentences as either **travel-related** or **non-travel-related**, with high accuracy on French datasets.

### Example Use Case:
Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:
- `POSITIVE` as `label`
- `0.9999655485153198` as `score`

Given a sentence such as "Je veux acheter du pain", the model will detect and label:
- `NEGATIVE` as `label`
- `0.9999724626541138` as `score`

### Limitations:
- **Language**: Optimized for French text, performance on other languages is not guaranteed.
- **Performance**: Specifically trained for binary classification. Performance may degrade on multi-class or unrelated tasks.

## Labels
The model uses the following entity labels:
- `POSITIVE`: Travel-related sentences
- `NEGATIVE`: Non-travel-related sentences

## Training Data
The model was fine-tuned using a proprietary French dataset: [Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET](https://huggingface.co/datasets/Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET). This dataset contains thousands of labeled examples for travel and non-travel sentences.

## Hyperparameters and Fine-Tuning:
- **Learning Rate**: 5e-5
- **Batch Size**: 16
- **Epochs**: 3
- **Evaluation Strategy**: Epoch-based
- **Optimizer**: AdamW

## Tokenizer
The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:
- **Max Length**: 128
- **Padding**: Right-padded to 128 tokens
- **Truncation**: Longest-first strategy, truncating tokens beyond 128.

## How to Use
You can load and use this model with Hugging Face’s `transformers` library and use pipeline function for creating a **text classification pipeline** as follows:

```python
from transformers import pipeline

model_path = "InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION"
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)

sentence = "Je veux aller de Paris à Lyon"
result = classifier(sentence)
print(result)
```

## Limitations and Bias
While the model performs well on the training and test datasets, there are some known limitations:
- **Bias in Dataset**: Performance may reflect the biases in the training data.
- **Generalization**: Results may be biased towards specific named entities frequently seen in the training data (such as city names).

## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).