metadata

license: apache-2.0
datasets:
  - cnmoro/Instruct-PTBR-ENUS-11M
  - graelo/wikipedia
  - uonlp/CulturaX
  - pablo-moreira/gpt4all-j-prompt-generations-pt
  - eduagarcia/OSCAR-2301-pt_dedup
  - eduagarcia/cc100-pt
  - iara-project/news-articles-ptbr-dataset
  - MBZUAI/Bactrian-X
  - Gustrd/dolly-15k-libretranslate-pt
  - heloisy/cosmos_qa_ptbr
  - maritaca-ai/imdb_pt
  - squad_v1_pt
  - celsowm/conjur_artigos
  - celsowm/ambito_juridico_artigos
  - arubenruben/cnn_dailymail_azure_pt_pt
  - bigscience-data/roots_pt_wikiquote
  - bigscience-data/roots_pt_ted_talks_iwslt
language:
  - pt
metrics:
  - perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
  - text-generation-inference
widget:
  - text: Astronomia é uma ciência natural que estuda
    example_title: Exemplo
  - text: Em um achado chocante, o cientista descobriu um
    example_title: Exemplo
  - text: Python é uma linguagem de
    example_title: Exemplo
  - text: O Gato de Schrödinger é uma experiência mental
    example_title: Exemplo
inference:
  parameters:
    repetition_penalty: 1.5
    temperature: 0.5
    top_k: 50
    top_p: 0.5
    max_new_tokens: 200
co2_eq_emissions:
  emissions: 15
  source: CodeCarbon
  training_type: pre-training
  geographical_location: Germany
  hardware_used: NVIDIA A100-SXM4-40GB

Teeny-tiny-llama-162m (Portuguese)

Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture (Tiny-llama implementation). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious.

Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.

Features

Compact Design: Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources.
Optimized Scaling: The model has been pre-trained using scaling logs to identify the ideal token-to-parameter ratio.
Custom Portuguese Dataset: Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.
Details
Size: 162 million parameters
Dataset: Portuguese-Corpus-v3
Language: Portuguese
Number of steps: 457969
Batch size: 4
Optimizer: torch.optim.AdamW (warmup_ratio = 0.01, learning_rate = 6e-4, epsilon = 1e-8)
GPU: 1 NVIDIA A100-SXM4-40GB
Training time: ~ 36 hours
Emissions: 15 KgCO2 (Germany)
Total Energy Consumption: 42 kWh

This repository has the source code used to train this model.

Training Set-up

Section	Setting	Value
Model args.	vocab_size	32000
	hidden_size	768
	intermediate_size	3072
	max_position_embeddings	2048
	num_attention_heads	12
	num_hidden_layers	12
	num_key_value_heads	12
	torch_dtype	"float32" *
Data args.	dataset_name	"nicholasKluge/portuguese-corpus-v3"
	dataset_split	"train"
	train_num_samples	1831873
	val_num_samples	18000
	block_size	2048
Training args.	evaluation_strategy	"steps"
	eval_steps	100000
	per_device_train_batch_size	4
	per_device_eval_batch_size	4
	gradient_accumulation_steps	1
	learning_rate	0.0006
	adam_epsilon	0.00000001
	weight_decay	0.01
	lr_scheduler_type	"cosine"
	warmup_ratio	0.01
	num_train_epochs	1
	gradient_checkpointing	false
	seed	42
	wandb_log_steps	1
	mixed_precision	'no'
	checkpointing_steps	22000

With tf32 enabled during training.

Usage

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-162m")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")
model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-162m")

Limitations

🤥 Generative AI models, like LLMs used for text generation/conversation or GANs for image generation, can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, given the model's tendency to output hallucinations. Such models can generate deceptive visuals, human-like textual content, music, or combined media that might seem genuine at first glance.

🤬 Machine learning systems can inherit social and historical stereotypes from the data used to train them. Given these biases, models can be prone to produce toxic content, that is, text, images, videos, or comments, that is harmful, offensive, or detrimental to individuals, groups, or communities. Also, models that automate decision-making can have biases against certain groups, affecting people based on sensitive attributes in an unjust manner.

Evaluations

Models	Average	ARC	Hellaswag	MMLU	TruthfulQA
Gpt2-portuguese-small	30.22	22.48 $\pm$ 0.01	29.62 $\pm$ 0.00	27.36 $\pm$ 0.00	41.44 $\pm$ 0.01

Evaluations were performed using the Language Model Evaluation Harness (by EleutherAI). Thanks to Laiviet for translating some of the tasks in the LM-Evaluation-Harness.

Steps	Evaluation Loss	Perplexity	Total Energy Consumption
100.000	3.19	24.52	3.75 kWh
200.000	3.02	20.58	7.51 kWh
300.000	2.83	16.98	11.25 kWh
400.000	2.79	16.41	30.20 kWh

Cite as 🤗


@misc{nicholas22llama,
  doi = {10.5281/zenodo.6989727},
  url = {https://huggingface.co/nicholasKluge/Teeny-tiny-llama-162m},
  author = {Nicholas Kluge Corrêa},
  title = {Teeny-tiny-llama},
  year = {2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
}

License

The Teeny-tiny-llama-162m is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.