patricia-rocha commited on
Commit
c28ea13
1 Parent(s): a361418

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md CHANGED
@@ -1,3 +1,80 @@
1
  ---
 
2
  license: cc-by-nc-4.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: pt
3
  license: cc-by-nc-4.0
4
+ co2_eq_emissions: 710
5
  ---
6
+
7
+ # QUOKKA
8
+
9
+ ## Model description
10
+
11
+ QUOKKA is the pioneering generative model for Portuguese from Portugal (PT-PT) to the best of our knowledge.
12
+ Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
13
+ The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
14
+
15
+ ## Intended uses & limitations
16
+
17
+ You can use the model for text generation in Portuguese or fine-tune it on a downstream task.
18
+
19
+ ### How to use
20
+
21
+ You can use this model directly with a pipeline for text generation:
22
+
23
+ ```python
24
+ from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
25
+
26
+
27
+ checkpoint = "automaise/quokka-7b"
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
30
+ model = AutoModelForCausalLM.from_pretrained(checkpoint)
31
+
32
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
33
+ generator("<human>Olá, consegues ajudar-me?<bot>")
34
+ ```
35
+
36
+ ### Limitations and bias
37
+
38
+ * **Language:** the model was fine-tuned on Portuguese data only and might not generalize appropriately to other languages.
39
+ * **Prompt Engineering:** the model's performance may vary depending on the prompt. We recommend writing clear
40
+ and specific instructions.
41
+ * **Bias:** the model might produce factually incorrect outputs or perpetuate biases present in its training data.
42
+ It is fundamental to be aware of these limitations and exercise caution when using the model for human-facing interactions.
43
+ This bias will also impact all subsequent fine-tuned versions of this model.
44
+
45
+ ## Training data
46
+
47
+ QUOKKA was fine-tuned on a dataset collected from different sources:
48
+
49
+ * Initially, we used the **[Bactrain-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)** dataset, which involves the
50
+ translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
51
+ For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
52
+
53
+ * Then, we incorporated the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
54
+ The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
55
+
56
+ Additionally, we conducted data curation to remove elements such as:
57
+
58
+ * Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
59
+ * Samples that lost meaning during the translation process, particularly those instructing the translation of a given text.
60
+
61
+ As a result, our final dataset comprises **56k samples**.
62
+
63
+ ## Training procedure
64
+
65
+ This model was trained on a **1 x NVIDIA A100 40GB** for about 4-5 hours using QLoRA.
66
+ This fine-tuning approach allowed us to significantly reduce memory usage and computation time.
67
+
68
+ ## Evaluation results
69
+
70
+ ## Environmental impact
71
+
72
+ Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact/#compute)
73
+ presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
74
+ The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
75
+
76
+ * **Hardware Type:** 1 x NVIDIA A100 40GB
77
+ * **Hours used:** 4-5
78
+ * **Cloud Provider:** Google Cloud Platform
79
+ * **Compute Region:** europe-west4
80
+ * **Carbon Emitted:** 0.71 kg eq. CO2