--- tags: - text-generation - pytorch inference: false license: llama2 language: - pt pipeline_tag: text-generation library_name: transformers datasets: - dominguesm/CC-MAIN-2023-23 ---

Camarim Logo


# Canarim-7B Canarim-7B is a Portuguese large language model developed by [Maicon Domingues](https://nlp.rocks). ## Model description The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023. ## Key Features - **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences. - **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results. - **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese. ## Applications Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications: - **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided. - **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format. - **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning. ### Tips for Efficient Use - **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task. - **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation. --- ## Getting Started To start using Canarim-7B with the Transformers library, first install the library if you haven't already: ```bash pip install transformers ``` You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function: ```python from transformers import AutoTokenizer, pipeline import torch model_id = "dominguesm/canarim-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) pipe = pipeline( "text-generation", model=model_id, torch_dtype=torch.float16, device_map="auto", ) prompt = make_prompt(question) sequences = pipe( prompt, do_sample=True, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, max_length=2048, temperature=0.9, top_p=0.6, repetition_penalty=1.15 ) ``` This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements. ## Citation If you want to cite **Canarim Instruct PTBR dataset**, you could use this: ``` @misc {maicon_domingues_2023, author = { {Maicon Domingues} }, title = { canarim-7b (Revision 08fdd2b) }, year = 2023, url = { https://huggingface.co/dominguesm/canarim-7b }, doi = { 10.57967/hf/1356 }, publisher = { Hugging Face } } ``` ## License Canarim-7B is released under the [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://ai.meta.com/llama/license/).