File size: 4,573 Bytes
a266c64
a95ffb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a266c64
a95ffb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
language:
- pt
tags:
- GlórIA
- European Portuguese
- gptneo
- decoder
- foundation model
- text-generation
datasets:
- europarl_bilingual
- assin2
- dlb/plue
- oscar-corpus/OSCAR-2301
- PORTULAN/glue-ptpt
widget:
- text: A culinária portuguesa é rica em aromas e
- text: Os computadores hoje em dia são muito
- text: A literatura Portuguesa é
inference:
  parameters:
    temperature: 1
    repetition_penalty: 2
    max_new_tokens: 30
    num_beams: 4
    do_sample: true
    top_k: 50
library_name: transformers
---

# GlórIA 1.3B

## Model Description
**GlórIA** is a large generative language model, with special **focus on European Portuguese**. 

It is a 1.3B parameters model, based on [GPTNeo](https://huggingface.co/EleutherAI/gpt-neo-1.3B), which has 24 layers and a hidden size of 2048.

## Training Data
**GlórIA 1.3B** was trained on a large corpora, with approximately 35B tokens. This corpus was built by gathering multiple Portuguese sources:
- [ArquivoPT News PT-PT Dataset](): A collection of 1.4M European Portuguese archived news and periodicals from [Arquivo.pt](https://arquivo.pt/).
- [ClueWeb-Large PT-PT](https://lemurproject.org/clueweb22.php/): Multilingual Corpus, similar to OSCAR. Metadata was used to filter only PT-PT webpages.
- [Europarl PT-PT](https://www.statmt.org/europarl/): A parallel corpus with documents such as transcripts from the European Parliament (we only used the PT-PT documents).
- [OpenSubtitles PT-PT](https://opus.nlpl.eu/OpenSubtitles.php): A corpus containing PT-PT subtitles from [OpenSubtitles](http://www.opensubtitles.org/).
- [OSCAR PT-PT](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201): Multilingual Corpus obtained from filtering the Common Crawl corpus. We used metadata to filter only PT-PT webpages.
- [PT WIKI](): The Portuguese Wikipedia. 2022/06/20 Dump.

<br>

## Evaluation - CALAME-PT
GlórIA 1.3B generative capabilities were evaluated on **CALAME-PT** - a new Portuguese benchmark with the goal of predicting the last word of a sentence, according to a given context.

| Model and Size   | Exact-Match   |
| ---------------- | ----------    | 
| Gervasio-PTPT (1B)    | 44.01         | 
| mGPT (1.3B)            | 47.14         |
| GlórIA (1.3B)     | 52.79         | 


<br>

# How to use
## Basic Inference Example
```py
>>> from transformers import pipeline
>>> generator = pipeline('text-generation', model='NOVA-vision-language/GlorIA-1.3B-original')
>>> generator("A culinária portuguesa é rica em aromas e", do_sample=True, min_length=50)
[{'generated_text': 'A culinária portuguesa é rica em aromas e'}]
```
## Recommended Parameters and Usage (for more flexibility)
```py
from transformers import GenerationConfig, TextGenerationPipeline
generation_config = GenerationConfig(
        max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id,
        no_repeat_ngram_size=0, num_beams=4, repetition_penalty=2.0, temperature=1.0,
        output_scores=True, early_stopping=True
)
generator = TextGenerationPipeline(model=model, task="text-generation",
                                    tokenizer=loaded_tokenizer, device=0)
completion_prompts = ["Fernando Pessoa foi um dos poetas mais relevantes de"]
out = generator(completion_prompts, generation_config=generation_config)
[[{'generated_text': 'Fernando Pessoa foi um dos poetas mais relevantes de toda a literatura portuguesa, autor de uma obra que se estende por mais de quatro dezenas de livros, entre os quais "Mensagem", "O Guardador de Rebanhos", "Livro do desassossego", "Odes",'}]]
```

<br>


# Citation


Please use the following BibTeX to cite our paper:
```
@InProceedings{gloria_ptpt_propor2024,
  author="Lopes, Ricardo
          and Magalhães, João
          and Semedo, David",
  title="GlórIA: A Generative and Open Large Language Model for Portuguese",
  booktitle="Computational Processing of the Portuguese Language (PROPOR 2024)",
  year="2024",
}
```

**License**: GlórIA's usage is restricted to research-only purposes, subject to the ClueWeb22 Dataset license, which can be freely obtained [here](https://www.lemurproject.org/clueweb22/obtain.php).


# Acknowledgements

We would like to thank Arquivo.pt's team for their content preservation efforts, and for all the help and guidance in accessing the archived web pages at scale.
This work was partially funded by the FCT project NOVA LINCS Ref. UIDP/04516/2020, by CMU|Portugal project iFetch, Ref. CMUP LISBOA-01-0247-FEDER-045920, and by the FCT project Ref. Nº CPCA-IAC/AV/594875/2023.

<br>