automaise
/

quokka-7b

 ---
+language: pt
 license: cc-by-nc-4.0
+co2_eq_emissions: 710
 ---
+# QUOKKA
+## Model description
+QUOKKA is the pioneering generative model for Portuguese from Portugal (PT-PT) to the best of our knowledge.
+Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
+The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.
+## Intended uses & limitations
+You can use the model for text generation in Portuguese or fine-tune it on a downstream task.
+### How to use
+You can use this model directly with a pipeline for text generation:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
+checkpoint = "automaise/quokka-7b"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint)
+generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
+generator("<human>Olá, consegues ajudar-me?<bot>")
+```
+### Limitations and bias
+* **Language:** the model was fine-tuned on Portuguese data only and might not generalize appropriately to other languages.
+* **Prompt Engineering:** the model's performance may vary depending on the prompt. We recommend writing clear
+and specific instructions.
+* **Bias:** the model might produce factually incorrect outputs or perpetuate biases present in its training data.
+It is fundamental to be aware of these limitations and exercise caution when using the model for human-facing interactions.
+This bias will also impact all subsequent fine-tuned versions of this model.
+## Training data
+QUOKKA was fine-tuned on a dataset collected from different sources:
+* Initially, we used the **[Bactrain-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)** dataset, which involves the
+translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
+For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.
+* Then, we incorporated the **[Cabrita](https://github.com/22-hours/cabrita)** dataset that consists of a translation of Alpaca's training data.
+The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
+Additionally, we conducted data curation to remove elements such as:
+* Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
+* Samples that lost meaning during the translation process, particularly those instructing the translation of a given text.
+As a result, our final dataset comprises **56k samples**.
+## Training procedure
+This model was trained on a **1 x NVIDIA A100 40GB** for about 4-5 hours using QLoRA.
+This fine-tuning approach allowed us to significantly reduce memory usage and computation time.
+## Evaluation results
+## Environmental impact
+Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact/#compute)
+presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
+* **Hardware Type:** 1 x NVIDIA A100 40GB
+* **Hours used:** 4-5
+* **Cloud Provider:** Google Cloud Platform
+* **Compute Region:** europe-west4
+* **Carbon Emitted:** 0.71 kg eq. CO2