File size: 4,966 Bytes

1ef680a
 
 
 
 
 
fc78143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164aee6
f15013b
1ef680a
 
 
 
 
 
2c8ad5c
1ef680a
2c8ad5c
424abe1
2c8ad5c
0b8fea2
 
2c8ad5c
 
 
424abe1
208de4c
 
 
2c8ad5c
1ef680a
2c8ad5c
 
 
 
1ef680a
2c8ad5c
1ef680a
2c8ad5c
 
1ef680a
2c8ad5c
1ef680a
2c8ad5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ef680a
 
 
0b8fea2
376bb59
1ef680a
 
 
 
 
 
 
 
 
0b8fea2
1ef680a
 
 
376bb59
1ef680a
 
 
 
 
 
376bb59
 
 
 
 
3ac748f
376bb59
 
 
 
51b99cc
376bb59
51b99cc

---
tags:
- generated_from_trainer
model-index:
- name: EUBERT
  results: []
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
widget:
 - text: "The transition to a climate neutral, sustainable, energy and resource-efficient, circular and fair economy is key to ensuring the long-term competitiveness of the economy of the union and the well-being of its peoples. In 2016, the Union concluded the Paris Agreement2. Article 2(1), point (c), of the Paris Agreement sets out the objective of strengthening the response to climate change by, among other means, making finance flows consistent with a pathway towards low greenhouse gas [MASK] and climate resilient development."
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


## Model Card: EUBERT

### Overview

- **Model Name**: EUBERT
- **Model Version**: 1.1
- **Date of Release**: 16 October 2023
- **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers)
- **Training Data**: Documents registered by the European Publications Office
- **Model Use Case**: Text Classification, Question Answering, Language Understanding

![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png)


### Model Description

EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/).
These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains.
EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, 
making it a valuable resource for a variety of applications.

### Intended Use

EUBERT serves as a starting point for building more specific natural language understanding models.
Its versatility makes it suitable for a wide range of tasks, including but not limited to:

1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.

2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.

3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.

### Performance

The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning.
Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.

### Considerations

- **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.

- **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.

- **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.

### Conclusion

EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.


--- 

## Training procedure

Dedicated Word Piece tokenizer vocabulary size 2**16, 

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1.85

### Training results

Coming soon 

### Framework versions

- Transformers 4.33.3
- Pytorch 2.0.1+cu117
- Datasets 2.14.5
- Tokenizers 0.13.3

### Infrastructure 

- **Hardware Type:** 4 x GPUs 24GB
- **GPU Days:** 16
- **Cloud Provider:** EuroHPC
- **Compute Region:** Meluxina


# Author(s)

Sébastien Campion <sebastien.campion@europarl.europa.eu>