patricia-rocha commited on
Commit
8ebae91
1 Parent(s): ede5760

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -8,7 +8,7 @@ co2_eq_emissions: 710
8
 
9
  ## Model description
10
 
11
- QUOKKA is the pioneering generative model for Portuguese from Portugal (PT-PT) to the best of our knowledge.
12
  Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
13
  The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
14
 
@@ -54,7 +54,7 @@ generator(f"<human>{prompt}<bot>")
54
  >> A série Rabo de Peixe foi filmada na ilha de São Miguel, nos Açores.
55
  ```
56
 
57
- #### Generate synthetic data
58
 
59
  ```python
60
  prompt = "Gera uma frase semelhante à seguinte frase: Bom dia, em que posso ser útil?"
@@ -87,13 +87,13 @@ generator(f"<human>{prompt}<bot>")
87
 
88
  QUOKKA was fine-tuned on a dataset collected from different sources:
89
 
90
- * Initially, we used the **[Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)** dataset, which involves the
 
 
 
91
  translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
92
  For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
93
 
94
- * Then, we incorporated the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
95
- The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
96
-
97
  Additionally, we conducted data curation to remove elements such as:
98
 
99
  * Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
 
8
 
9
  ## Model description
10
 
11
+ QUOKKA is our first generative pre-trained transformer (GPT) model for Portuguese from Portugal (PT-PT).
12
  Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
13
  The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
14
 
 
54
  >> A série Rabo de Peixe foi filmada na ilha de São Miguel, nos Açores.
55
  ```
56
 
57
+ #### Synthetic data
58
 
59
  ```python
60
  prompt = "Gera uma frase semelhante à seguinte frase: Bom dia, em que posso ser útil?"
 
87
 
88
  QUOKKA was fine-tuned on a dataset collected from different sources:
89
 
90
+ * Initially, we used the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
91
+ The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
92
+
93
+ * Then, we incorporated the **[Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)** dataset, which involves the
94
  translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
95
  For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
96
 
 
 
 
97
  Additionally, we conducted data curation to remove elements such as:
98
 
99
  * Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.