File size: 4,135 Bytes
164707c
 
f5bbea2
 
164707c
c9661ec
164707c
f5bbea2
164707c
 
f5bbea2
 
164707c
 
 
 
 
 
 
 
 
11f68f2
164707c
11f68f2
164707c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4789cb
 
 
 
98567cb
f4789cb
 
 
 
 
 
 
 
98567cb
f4789cb
164707c
 
08fdd2b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
tags:
- text-generation
- pytorch
inference: false
license: llama2
language:
- pt
pipeline_tag: text-generation
library_name: transformers
datasets:
- dominguesm/CC-MAIN-2023-23
---


<p align="center">
  <img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png">
</p>

<hr>

# Canarim-7B

Canarim-7B is a Portuguese large language model developed by [Maicon Domingues](https://nlp.rocks).

## Model description

The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.

## Key Features

-   **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
-   **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
-   **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.

## Applications

Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:

-   **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
-   **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
-   **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.

### Tips for Efficient Use

-   **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
-   **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.

---

## Getting Started

To start using Canarim-7B with the Transformers library, first install the library if you haven't already:

```bash
pip install transformers
```

You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function:

```python
from transformers import AutoTokenizer, pipeline
import torch

model_id = "dominguesm/canarim-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = make_prompt(question)
sequences = pipe(
   prompt,
   do_sample=True,
   num_return_sequences=1,
   eos_token_id=tokenizer.eos_token_id,
   max_length=2048,
   temperature=0.9,
   top_p=0.6,
   repetition_penalty=1.15
)
```

This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements.

## Citation

If you want to cite **Canarim Instruct PTBR dataset**, you could use this:

```
@misc {maicon_domingues_2023,
	author       = { {Maicon Domingues} },
	title        = { canarim-7b (Revision 08fdd2b) },
	year         = 2023,
	url          = { https://huggingface.co/dominguesm/canarim-7b },
	doi          = { 10.57967/hf/1356 },
	publisher    = { Hugging Face }
}
```

## License

Canarim-7B is released under the [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://ai.meta.com/llama/license/).