File size: 3,756 Bytes
164707c f5bbea2 164707c c9661ec 164707c f5bbea2 164707c f5bbea2 164707c 11f68f2 164707c 11f68f2 164707c c9661ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
tags:
- text-generation
- pytorch
inference: false
license: llama2
language:
- pt
pipeline_tag: text-generation
library_name: transformers
datasets:
- dominguesm/CC-MAIN-2023-23
---
<p align="center">
<img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png">
</p>
<hr>
# Canarim-7B
Canarim-7B is a Portuguese large language model developed by [Maicon Domingues](https://nlp.rocks).
## Model description
The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.
## Key Features
- **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
- **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
- **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.
## Applications
Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:
- **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
- **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
- **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.
### Tips for Efficient Use
- **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
- **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.
---
## Getting Started
To start using Canarim-7B with the Transformers library, first install the library if you haven't already:
```bash
pip install transformers
```
You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function:
```python
from transformers import AutoTokenizer, pipeline
import torch
model_id = "dominguesm/canarim-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.float16,
device_map="auto",
)
prompt = make_prompt(question)
sequences = pipe(
prompt,
do_sample=True,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=2048,
temperature=0.9,
top_p=0.6,
repetition_penalty=1.15
)
```
This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements.
## License
Canarim-7B is released under the [LLAMA 2 COMMUNITY LICENSE AGREEMENT)](https://ai.meta.com/llama/license/). |