Lloro / README.md

rafaelgeraldini

Upload Tokenizer v3

a921d0f verified 4 months ago

preview code

raw

history blame

No virus

5.88 kB

	---
	language:
	- pt
	license: llama2
	library_name: transformers
	tags:
	- code
	- analytics
	- analise-dados
	- portugues-BR
	base_model: codellama/CodeLlama-7b-Instruct-hf
	datasets:
	- semantixai/Test-Dataset-Lloro
	---

	Lloro 7B

	<img src="https://cdn-uploads.huggingface.co/production/uploads/653176dc69fffcfe1543860a/h0kNd9OTEu1QdGNjHKXoq.png" width="300" alt="Lloro-7b Logo"/>


	Lloro, developed by Semantix Research Labs , is a language Model that was trained to effectively perform Portuguese Data Analysis in Python. It is a fine-tuned version of codellama/CodeLlama-7b-Instruct-hf, that was trained on synthetic datasets . The fine-tuning process was performed using the QLORA metodology on a GPU V100 with 16 GB of RAM.



	Model description


	Model type: A 7B parameter fine-tuned on synthetic datasets.

	Language(s) (NLP): Primarily Portuguese, but the model is capable to understand English as well

	Finetuned from model: codellama/CodeLlama-7b-Instruct-hf



	What is Lloro's intended use(s)?


	Lloro is built for data analysis in Portuguese contexts .

	Input : Text

	Output : Text (Code)


	Usage

	Using Transformers
	```python
	#Import required libraries
	import torch
	from transformers import (
	AutoModelForCausalLM,
	AutoTokenizer
	)

	#Load Model
	model_name = "semantixai/LloroV2"
	base_model = AutoModelForCausalLM.from_pretrained(
	model_name,
	return_dict=True,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	#Load Tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


	#Define Prompt
	user_prompt = "Desenvolva um algoritmo em Python para calcular a média e a mediana dos preços de vendas por tipo de material do produto."
	system = "Provide answers in Python without explanations, only the code"
	prompt_template = f"[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user_prompt}[/INST]"

	#Call the model
	input_ids = tokenizer([prompt_template], return_tensors="pt")["input_ids"].to("cuda")


	outputs = base_model.generate(
	input_ids,
	do_sample=True,
	top_p=0.95,
	max_new_tokens=1024,
	temperature=0.1,
	)

	#Decode and retrieve Output
	output_text = tokenizer.batch_decode(outputs, skip_prompt=True, skip_special_tokens=False)
	display(output_text)
	```

	Using an OpenAI compatible inference server (like [vLLM](https://docs.vllm.ai/en/latest/index.html))
	```python
	from openai import OpenAI

	client = OpenAI(
	api_key="EMPTY",
	base_url="http://localhost:8000/v1",
	)
	user_prompt = "Desenvolva um algoritmo em Python para calcular a média e a mediana dos preços de vendas por tipo de material do produto."
	completion = client.chat.completions.create(temperature=0.1,frequency_penalty=0.1,model="semantixai/LloroV2",messages=[{"role":"system","content":"Provide answers in Python without explanations, only the code"},{"role":"user","content":user_prompt}])
	```


	Params
	Training Parameters
	\| Params \| Training Data \| Examples \| Tokens \| LR \|
	\|----------------------------------\|---------------------------------\|---------------------------------\|----------\|--------\|
	\| 7B \| Pairs synthetic instructions/code \| 28907 \| 3 031 188 \| 1e-5 \|


	Model Sources

	Test Dataset Repository: https://huggingface.co/datasets/semantixai/Test-Dataset-Lloro

	Model Dates Lloro was trained between November 2023 and January 2024.



	Performance
	\| Modelo \| LLM as Judge \| Code Bleu Score \| Rouge-L \| CodeBert- Precision \| CodeBert-Recall \| CodeBert-F1 \| CodeBert-F3 \|
	\|----------------\|--------------\|------------------\|---------\|----------------------\|-----------------\|-------------\|-------------\|
	\| GPT 3.5 \| 91.22% \| 0.2745 \| 0.2189 \| 0.7502 \| 0.7146 \| 0.7303 \| 0.7175 \|
	\| Instruct -Base \| 97.40% \| 0.2487 \| 0.1146 \| 0.6997 \| 0.6473 \| 0.6713 \| 0.6518 \|
	\| Instruct -FT \| 97.76% \| 0.3264 \| 0.3602 \| 0.7942 \| 0.8178 \| 0.8042 \| 0.8147 \|


	Training Infos:
	The following hyperparameters were used during training:

	\| Parameter \| Value \|
	\|---------------------------\|----------------------\|
	\| learning_rate \| 1e-5 \|
	\| weight_decay \| 0.0001 \|
	\| train_batch_size \| 1 \|
	\| eval_batch_size \| 1 \|
	\| seed \| 42 \|
	\| optimizer \| Adam - paged_adamw_32bit \|
	\| lr_scheduler_type \| cosine \|
	\| lr_scheduler_warmup_ratio \| 0.03 \|
	\| num_epochs \| 5.0 \|

	QLoRA hyperparameters
	The following parameters related with the Quantized Low-Rank Adaptation and Quantization were used during training:

	\| Parameter \| Value \|
	\|------------------\|---------\|
	\| lora_r \| 16 \|
	\| lora_alpha \| 64 \|
	\| lora_dropout \| 0.1 \|
	\| storage_dtype \| "nf4" \|
	\| compute_dtype \| "float16"\|


	Experiments
	\| Model \| Epochs \| Overfitting \| Final Epochs \| Training Hours \| CO2 Emission (Kg) \|
	\|-----------------------\|--------\|-------------\|--------------\|-----------------\|--------------------\|
	\| Code Llama Instruct \| 1 \| No \| 1 \| 8.1 \| 1.337 \|
	\| Code Llama Instruct \| 5 \| Yes \| 3 \| 45.6 \| 9.12 \|

	Framework versions

	\| Library \| Version \|
	\|---------------\|-----------\|
	\| bitsandbytes \| 0.40.2 \|
	\| Datasets \| 2.14.3 \|
	\| Pytorch \| 2.0.1 \|
	\| Tokenizers \| 0.14.1 \|
	\| Transformers \| 4.34.0 \|