cesar-ptbr / README.md
matheusrdgsf's picture
Update README.md
670dcc4
|
raw
history blame
2.6 kB
metadata
library_name: peft
base_model: TheBloke/zephyr-7B-beta-GPTQ
revision: gptq-8bit-32g-actorder_True
license: mit
language:
  - pt
tags:
  - gptq
  - ptbr

Training procedure

The following bitsandbytes quantization config was used during training:

  • quant_method: gptq
  • bits: 8
  • tokenizer: None
  • dataset: None
  • group_size: 32
  • damp_percent: 0.1
  • desc_act: True
  • sym: True
  • true_sequential: True
  • use_cuda_fp16: False
  • model_seqlen: 4096
  • block_name_to_quantize: model.layers
  • module_name_preceding_first_block: ['model.embed_tokens']
  • batch_size: 1
  • pad_token_id: None
  • disable_exllama: True
  • max_input_length: None

Framework versions

Load model

from transformers import AutoModelForCausalLM, GPTQConfig
from peft import PeftModel

bnb_config = GPTQConfig(
    bits=8,
    disable_exllama=True,
)

_model = AutoModelForCausalLM.from_pretrained(
    'TheBloke/zephyr-7B-beta-GPTQ',
    quantization_config=bnb_config,
    device_map='auto',
    revision='gptq-8bit-32g-actorder_True',
)

model = PeftModel.from_pretrained(_model, 'matheusrdgsf/cesar-ptbr')

Easy inference

from transformers import GenerationConfig
from transformers import AutoTokenizer

tokenizer_model = AutoTokenizer.from_pretrained('TheBloke/zephyr-7B-beta-GPTQ')
tokenizer_template = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-alpha')

generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.1,
    top_p=0.25,
    top_k=0,
    max_new_tokens=512,
    repetition_penalty=1.1,
    eos_token_id=tokenizer_model.eos_token_id,
    pad_token_id=tokenizer_model.eos_token_id,
)


def get_inference(
    text,
    model,
    tokenizer_model=tokenizer_model,
    tokenizer_template=tokenizer_template,
    generation_config=generation_config,
):
    st_time = time.time()
    inputs = tokenizer_model(
        tokenizer_template.apply_chat_template(
            [
                {
                    "role": "system",
                    "content": "Você é um chatbot para indicação de filmes. Responda de maneira educada sugestões de filmes para os usuários.",
                },
                {"role": "user", "content": text},
            ],
            tokenize=False,
        ),
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(**inputs, generation_config=generation_config)

    print('inference time:', time.time() - st_time)
    return tokenizer_model.decode(outputs[0], skip_special_tokens=True).split('\n')[-1]

get_inference('Poderia indicar filmes de ação de até 2 horas?', model)
  • PEFT 0.5.0