|
--- |
|
language: |
|
- pt |
|
tags: |
|
- GlórIA |
|
- European Portuguese |
|
- gptneo |
|
- decoder |
|
- foundation model |
|
- text-generation |
|
datasets: |
|
- NOVA-vision-language/calame-pt |
|
- europarl_bilingual |
|
- assin2 |
|
- dlb/plue |
|
- oscar-corpus/OSCAR-2301 |
|
- PORTULAN/glue-ptpt |
|
widget: |
|
- text: A culinária portuguesa é rica em aromas e |
|
- text: Os computadores hoje em dia são muito |
|
- text: A literatura Portuguesa é |
|
inference: |
|
parameters: |
|
temperature: 1 |
|
repetition_penalty: 2 |
|
max_new_tokens: 30 |
|
num_beams: 4 |
|
do_sample: true |
|
top_k: 50 |
|
library_name: transformers |
|
--- |
|
|
|
# GlórIA 1.3B |
|
|
|
|
|
<p align="left"><img src="https://github.com/rvlopes/GlorIA/blob/main/gloria-logo.png?raw=true" width="30%"></p> |
|
|
|
## Model Description |
|
**GlórIA** is a large generative language model, with special **focus on European Portuguese**. |
|
|
|
It is a 1.3B parameters model, based on [GPTNeo](https://huggingface.co/EleutherAI/gpt-neo-1.3B), which has 24 layers and a hidden size of 2048. |
|
|
|
You can check our [paper](https://aclanthology.org/2024.propor-1.45/), accepted in PROPOR 2024. |
|
|
|
## Training Data |
|
**GlórIA 1.3B** was trained on a large corpora, with approximately 35B tokens. This corpus was built by gathering multiple Portuguese sources: |
|
- [ArquivoPT News PT-PT Dataset](): A collection of 1.4M European Portuguese archived news and periodicals from [Arquivo.pt](https://arquivo.pt/). |
|
- [ClueWeb-Large PT-PT](https://lemurproject.org/clueweb22.php/): Multilingual Corpus, similar to OSCAR. Metadata was used to filter only PT-PT webpages. |
|
- [Europarl PT-PT](https://www.statmt.org/europarl/): A parallel corpus with documents such as transcripts from the European Parliament (we only used the PT-PT documents). |
|
- [OpenSubtitles PT-PT](https://opus.nlpl.eu/OpenSubtitles.php): A corpus containing PT-PT subtitles from [OpenSubtitles](http://www.opensubtitles.org/). |
|
- [OSCAR PT-PT](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201): Multilingual Corpus obtained from filtering the Common Crawl corpus. We used metadata to filter only PT-PT webpages. |
|
- [PT WIKI](): The Portuguese Wikipedia. 2022/06/20 Dump. |
|
|
|
<br> |
|
|
|
## Evaluation - CALAME-PT |
|
GlórIA 1.3B generative capabilities were evaluated on **CALAME-PT** - a new Portuguese benchmark with the goal of predicting the last word of a sentence, according to a given context. |
|
|
|
| Model and Size | Exact-Match | |
|
| ---------------- | ---------- | |
|
| Gervasio-PTPT (1B) | 44.01 | |
|
| mGPT (1.3B) | 47.14 | |
|
| GlórIA (1.3B) | 52.79 | |
|
|
|
|
|
<br> |
|
|
|
# How to use |
|
## Basic Inference Example |
|
```py |
|
>>> from transformers import pipeline |
|
>>> generator = pipeline('text-generation', model='NOVA-vision-language/GlorIA-1.3B') |
|
>>> generator("A culinária portuguesa é rica em aromas e", do_sample=True, min_length=50) |
|
[{'generated_text': 'A culinária portuguesa é rica em aromas e'}] |
|
``` |
|
## Recommended Parameters and Usage (for more flexibility) |
|
```py |
|
from transformers import GenerationConfig, TextGenerationPipeline |
|
generation_config = GenerationConfig( |
|
max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id, |
|
no_repeat_ngram_size=0, num_beams=4, repetition_penalty=2.0, temperature=1.0, |
|
output_scores=True, early_stopping=True |
|
) |
|
generator = TextGenerationPipeline(model=model, task="text-generation", |
|
tokenizer=loaded_tokenizer, device=0) |
|
completion_prompts = ["Fernando Pessoa foi um dos poetas mais relevantes de"] |
|
out = generator(completion_prompts, generation_config=generation_config) |
|
[[{'generated_text': 'Fernando Pessoa foi um dos poetas mais relevantes de toda a literatura portuguesa, autor de uma obra que se estende por mais de quatro dezenas de livros, entre os quais "Mensagem", "O Guardador de Rebanhos", "Livro do desassossego", "Odes",'}]] |
|
``` |
|
|
|
<br> |
|
|
|
|
|
# Citation |
|
|
|
|
|
Please use the following BibTeX to cite our paper: |
|
``` |
|
@inproceedings{lopes-etal-2024-gloria, |
|
title = "{G}l{\'o}r{IA}: A Generative and Open Large Language Model for {P}ortuguese", |
|
author = "Lopes, Ricardo and |
|
Magalhaes, Joao and |
|
Semedo, David", |
|
editor = "Gamallo, Pablo and |
|
Claro, Daniela and |
|
Teixeira, Ant{\'o}nio and |
|
Real, Livy and |
|
Garcia, Marcos and |
|
Oliveira, Hugo Gon{\c{c}}alo and |
|
Amaro, Raquel", |
|
booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese", |
|
month = mar, |
|
year = "2024", |
|
address = "Santiago de Compostela, Galicia/Spain", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.propor-1.45", |
|
pages = "441--453", |
|
} |
|
``` |
|
|
|
**License**: GlórIA's usage is restricted to research-only purposes, subject to the ClueWeb22 Dataset license, which can be freely obtained [here](https://www.lemurproject.org/clueweb22/obtain.php). |
|
|
|
|
|
# Acknowledgements |
|
|
|
We would like to thank Arquivo.pt's team for their content preservation efforts, and for all the help and guidance in accessing the archived web pages at scale. |
|
This work was partially funded by the FCT project NOVA LINCS Ref. UIDP/04516/2020, by CMU|Portugal project iFetch, Ref. CMUP LISBOA-01-0247-FEDER-045920, and by the FCT project Ref. Nº CPCA-IAC/AV/594875/2023. |
|
|
|
<br> |
|
|
|
|
|
|
|
|
|
|
|
|