canarim-7b / README.md
dominguesm's picture
Update README.md
98567cb
metadata
tags:
  - text-generation
  - pytorch
inference: false
license: llama2
language:
  - pt
pipeline_tag: text-generation
library_name: transformers
datasets:
  - dominguesm/CC-MAIN-2023-23

Camarim Logo


Canarim-7B

Canarim-7B is a Portuguese large language model developed by Maicon Domingues.

Model description

The model was pretrained on 16 billion tokens from the Portuguese subset of CommonCrawl 2023-23, starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.

Key Features

  • Language: Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
  • Architecture: Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
  • Diverse Dataset: The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.

Applications

Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:

  • Natural Language Understanding (NLU): Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
  • Natural Language Generation (NLG): Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
  • Language Translation: Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.

Tips for Efficient Use

  • Few-shot Learning: When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
  • Contextualization: Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.

Getting Started

To start using Canarim-7B with the Transformers library, first install the library if you haven't already:

pip install transformers

You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the pipeline function:

from transformers import AutoTokenizer, pipeline
import torch

model_id = "dominguesm/canarim-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = make_prompt(question)
sequences = pipe(
   prompt,
   do_sample=True,
   num_return_sequences=1,
   eos_token_id=tokenizer.eos_token_id,
   max_length=2048,
   temperature=0.9,
   top_p=0.6,
   repetition_penalty=1.15
)

This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like max_length according to your requirements.

Citation

If you want to cite Canarim Instruct PTBR dataset, you could use this:

@misc {maicon_domingues_2023,
    author       = { {Maicon Domingues} },
    title        = { canarim-7b (Revision 08fdd2b) },
    year         = 2023,
    url          = { https://huggingface.co/dominguesm/canarim-7b },
    doi          = { 10.57967/hf/1356 },
    publisher    = { Hugging Face }
}

License

Canarim-7B is released under the LLAMA 2 COMMUNITY LICENSE AGREEMENT.