quokka-7b / README.md

Upload README.md

c28ea13 over 1 year ago

3.62 kB

	---
	language: pt
	license: cc-by-nc-4.0
	co2_eq_emissions: 710
	---

	# QUOKKA

	## Model description

	QUOKKA is the pioneering generative model for Portuguese from Portugal (PT-PT) to the best of our knowledge.
	Our model is a fine-tuned version of [Phoenix](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b) that was released on 04/08/2023.
	The backbone of Phoenix is [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1), which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.

	## Intended uses & limitations

	You can use the model for text generation in Portuguese or fine-tune it on a downstream task.

	### How to use

	You can use this model directly with a pipeline for text generation:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


	checkpoint = "automaise/quokka-7b"

	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint)

	generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
	generator("<human>Olá, consegues ajudar-me?<bot>")
	```

	### Limitations and bias

	* Language: the model was fine-tuned on Portuguese data only and might not generalize appropriately to other languages.
	* Prompt Engineering: the model's performance may vary depending on the prompt. We recommend writing clear
	and specific instructions.
	* Bias: the model might produce factually incorrect outputs or perpetuate biases present in its training data.
	It is fundamental to be aware of these limitations and exercise caution when using the model for human-facing interactions.
	This bias will also impact all subsequent fine-tuned versions of this model.

	## Training data

	QUOKKA was fine-tuned on a dataset collected from different sources:

	* Initially, we used the [Bactrain-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X) dataset, which involves the
	translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API.
	For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.

	* Then, we incorporated the [Cabrita](https://github.com/22-hours/cabrita) dataset that consists of a translation of Alpaca's training data.
	The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.

	Additionally, we conducted data curation to remove elements such as:

	* Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
	* Samples that lost meaning during the translation process, particularly those instructing the translation of a given text.

	As a result, our final dataset comprises 56k samples.

	## Training procedure

	This model was trained on a 1 x NVIDIA A100 40GB for about 4-5 hours using QLoRA.
	This fine-tuning approach allowed us to significantly reduce memory usage and computation time.

	## Evaluation results

	## Environmental impact

	Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact/#compute)
	presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
	The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

	* Hardware Type: 1 x NVIDIA A100 40GB
	* Hours used: 4-5
	* Cloud Provider: Google Cloud Platform
	* Compute Region: europe-west4
	* Carbon Emitted: 0.71 kg eq. CO2