patricia-rocha
commited on
Commit
•
8ebae91
1
Parent(s):
ede5760
Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ co2_eq_emissions: 710
|
|
8 |
|
9 |
## Model description
|
10 |
|
11 |
-
QUOKKA is
|
12 |
Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
|
13 |
The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
|
14 |
|
@@ -54,7 +54,7 @@ generator(f"<human>{prompt}<bot>")
|
|
54 |
>> A série Rabo de Peixe foi filmada na ilha de São Miguel, nos Açores.
|
55 |
```
|
56 |
|
57 |
-
####
|
58 |
|
59 |
```python
|
60 |
prompt = "Gera uma frase semelhante à seguinte frase: Bom dia, em que posso ser útil?"
|
@@ -87,13 +87,13 @@ generator(f"<human>{prompt}<bot>")
|
|
87 |
|
88 |
QUOKKA was fine-tuned on a dataset collected from different sources:
|
89 |
|
90 |
-
* Initially, we used the **[
|
|
|
|
|
|
|
91 |
translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
|
92 |
For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
|
93 |
|
94 |
-
* Then, we incorporated the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
|
95 |
-
The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
|
96 |
-
|
97 |
Additionally, we conducted data curation to remove elements such as:
|
98 |
|
99 |
* Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
|
|
|
8 |
|
9 |
## Model description
|
10 |
|
11 |
+
QUOKKA is our first generative pre-trained transformer (GPT) model for Portuguese from Portugal (PT-PT).
|
12 |
Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
|
13 |
The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
|
14 |
|
|
|
54 |
>> A série Rabo de Peixe foi filmada na ilha de São Miguel, nos Açores.
|
55 |
```
|
56 |
|
57 |
+
#### Synthetic data
|
58 |
|
59 |
```python
|
60 |
prompt = "Gera uma frase semelhante à seguinte frase: Bom dia, em que posso ser útil?"
|
|
|
87 |
|
88 |
QUOKKA was fine-tuned on a dataset collected from different sources:
|
89 |
|
90 |
+
* Initially, we used the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
|
91 |
+
The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
|
92 |
+
|
93 |
+
* Then, we incorporated the **[Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)** dataset, which involves the
|
94 |
translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
|
95 |
For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
|
96 |
|
|
|
|
|
|
|
97 |
Additionally, we conducted data curation to remove elements such as:
|
98 |
|
99 |
* Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
|