mambarim-110m / README.md
dominguesm's picture
Update README.md
e66d873 verified
metadata
library_name: transformers
language:
  - pt
license: cc-by-4.0
tags:
  - text-generation
  - pytorch
  - LLM
  - Portuguese
  - mamba
datasets:
  - nicholasKluge/Pt-Corpus-Instruct
inference:
  parameters:
    repetition_penalty: 1.2
    temperature: 0.8
    top_k: 50
    top_p: 0.85
    max_new_tokens: 150
widget:
  - text: O Natal é uma
    example_title: Exemplo
  - text: A muitos anos atrás, em uma galáxia muito distante, vivia uma raça de
    example_title: Exemplo
  - text: Em meio a um escândalo, a frente parlamentar pediu ao Senador Silva para
    example_title: Exemplo
pipeline_tag: text-generation

Mambarim-110M

Camarim Logo


Model Summary

Mambarim-110M is the first Portuguese language model based on a state-space model architecture (Mamba), not a transformer.

WIP

Details

  • Architecture: a Mamba model pre-trained via causal language modeling
  • Size: 119,930,880 parameters
  • Context length: 2048 tokens
  • Dataset: Pt-Corpus Instruct (6.2B tokens)
  • Language: Portuguese
  • Number of steps: 758,423

This repository has the source code used to train this model.

Intended Uses

WIP

Out-of-scope Use

WIP

Basic usage

You need to install transformers from main until transformers=4.39.0 is released.

pip install git+https://github.com/huggingface/transformers@main

We also recommend you to install both causal_conv_1d and mamba-ssm using:

pip install causal-conv1d>=1.2.0
pip install mamba-ssm

You can use the classic generate API:

>>> from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("dominguesm/mambarim-110m")
>>> model = MambaForCausalLM.from_pretrained("dominguesm/mambarim-110m")
>>> input_ids = tokenizer("O Natal é uma", return_tensors="pt")["input_ids"]
>>> out = model.generate(
    input_ids,
    repetition_penalty=1.2,
    temperature=0.8,
    top_k=50,
    top_p=0.85,
    do_sample=True,
    max_new_tokens=10
)
>>> print(tokenizer.batch_decode(out))
["<s> O Natal é uma data em que as pessoas passam horas de lazer e"]

Benchmarks

Evaluations on Brazilian Portuguese benchmarks were performed using a Portuguese implementation of the EleutherAI LM Evaluation Harness (created by Eduardo Garcia).

Detailed results can be found here

Model Average ENEM BLUEX OAB Exams ASSIN2 RTE ASSIN2 STS FAQNAD NLI HateBR PT Hate Speech tweetSentBR Architecture
TeenyTinyLlama-460m 28.86 20.15 25.73 27.02 53.61 13 46.41 33.59 22.99 17.28 LlamaForCausalLM
TeenyTinyLlama-160m 28.2 19.24 23.09 22.37 53.97 0.24 43.97 36.92 42.63 11.39 LlamaForCausalLM
MulaBR/Mula-4x160-v0.1 26.24 21.34 25.17 25.06 33.57 11.35 43.97 41.5 22.99 11.24 MixtralForCausalLM
TeenyTinyLlama-460m-Chat 25.49 20.29 25.45 26.74 43.77 4.52 34 33.49 22.99 18.13 LlamaForCausalLM
manbarim-110m 14.16 18.4 10.57 21.87 16.09 1.89 9.29 15.75 17.77 15.79 MambaForCausalLM
GloriaTA-3B 4.09 1.89 3.2 5.19 0 2.32 0.26 0.28 23.52 0.19 GPTNeoForCausalLM