dominguesm commited on
Commit
164707c
1 Parent(s): c6a23b9

Adicionado README

Browse files
Files changed (2) hide show
  1. README.md +90 -0
  2. assets/canarim.png +0 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-generation
4
+ - pytorch
5
+ inference: false
6
+ license: cc-by-4.0
7
+ language:
8
+ - pt
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
+ ---
12
+
13
+
14
+ <p align="center">
15
+ <img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png">
16
+ </p>
17
+
18
+ <hr>
19
+
20
+ # `Canarim-7B`
21
+
22
+ Canarim-7B is a Portuguese language model developed by [Maicon Domingues](https://nlp.rocks).
23
+
24
+ ## Model description
25
+
26
+ The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.
27
+
28
+ ## Key Features
29
+
30
+ - **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
31
+ - **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
32
+ - **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.
33
+
34
+ ## Applications
35
+
36
+ Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:
37
+
38
+ - **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
39
+ - **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
40
+ - **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.
41
+
42
+ ### Tips for Efficient Use
43
+
44
+ - **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
45
+ - **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.
46
+
47
+ ---
48
+
49
+ ## Getting Started
50
+
51
+ To start using Canarim-7B with the Transformers library, first install the library if you haven't already:
52
+
53
+ ```bash
54
+ pip install transformers
55
+ ```
56
+
57
+ You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function:
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer, pipeline
61
+ import torch
62
+
63
+ model_id = "dominguesm/canarim-7b"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
65
+
66
+ pipe = pipeline(
67
+ "text-generation",
68
+ model=model_id,
69
+ torch_dtype=torch.float16,
70
+ device_map="auto",
71
+ )
72
+
73
+ prompt = make_prompt(question)
74
+ sequences = pipe(
75
+ prompt,
76
+ do_sample=True,
77
+ num_return_sequences=1,
78
+ eos_token_id=tokenizer.eos_token_id,
79
+ max_length=2048,
80
+ temperature=0.9,
81
+ top_p=0.6,
82
+ repetition_penalty=1.15
83
+ )
84
+ ```
85
+
86
+ This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements.
87
+
88
+ ## License
89
+
90
+ Canarim-7B is released under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). This license allows others to copy, distribute, remix, adapt, and build upon the work, even commercially, as long as they credit the original creation.
assets/canarim.png ADDED