Tucano - a TucanoBR Collection

TucanoBR 's Collections

Tucano

updated Nov 13

Tucano is a series of decoder-transformers based on the Llama 2 architecture, natively pre-trained in Portuguese.

Upvote

Tucano: Advancing Neural Text Generation for Portuguese

Paper • 2411.07854 • Published Nov 12 • 4
TucanoBR/Tucano-2b4

Text Generation • Updated Nov 17 • 284 • 3

Note 2.4 billion-parameter version of the Tucano series.
TucanoBR/Tucano-2b4-Instruct

Text Generation • Updated Nov 17 • 510 • 2

Note 2.4 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-1b1

Text Generation • Updated Nov 17 • 493

Note 1.1 billion-parameter version of the Tucano series.
TucanoBR/Tucano-1b1-Instruct

Text Generation • Updated Nov 17 • 387 • 1

Note 1.1 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-630m

Text Generation • Updated Nov 17 • 120 • 1

Note 630 million-parameter version of the Tucano series.
TucanoBR/Tucano-160m

Text Generation • Updated Nov 17 • 272 • 1

Note 160 million-parameter version of the Tucan series.
TucanoBR/BERTimbau-large-text-filter

Text Classification • Updated Nov 13 • 15

Note BERTimbau-large fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/BERTimbau-base-text-filter

Text Classification • Updated Nov 13 • 30

Note BERTimbau-base fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/XGBClassifier-text-filter

Updated Nov 13

Note XGBClassifier trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/XGBRegressor-text-filter

Updated Nov 13

Note XGBRegressor trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/GigaVerbo

Viewer • Updated Nov 13 • 145M • 1.95k • 11

Note GigaVerbo is an extensive dataset comprising 780 GB of Portuguese text, being a concatenated version of several datasets available in Hugging Face, containing over 200 billion tokens.
TucanoBR/GigaVerbo-Text-Filter

Viewer • Updated Nov 13 • 110k • 87

Note GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo, all scored by GPT-4o.
TucanoBR/Tucano-SFT

Viewer • Updated Nov 13 • 680k • 98

Note This is the dataset used to train the "Instruct" versions of the Tucano series.
TucanoBR/lambada-pt

Viewer • Updated Nov 7 • 5.15k • 49 • 2

Note This dataset is a translated version (Portuguese) of the LAMBADA test split as pre-processed by OpenAI.
TucanoBR/alpaca-eval-pt

Viewer • Updated Nov 11 • 805 • 50

Note This dataset contains 805 translated samples (Portuguese) from the Alpaca dataset.
nicholasKluge/reward-aira-dataset

Viewer • Updated Jun 18 • 70k • 103 • 3

Note This dataset contains pairs of completions to prompts. Used for DPO fine-tuning.

Upvote