metadata
library_name: transformers
language:
- en
license: apache-2.0
base_model: BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024
tags:
- flan
- t5
- gqa
- instruct
datasets:
- pszemraj/flan-subsets-deduped
tFINE-680m-e32-d16-gqa-flan
FLAN-tuned variant of a tFINE (t5) model with GQA.
- 32 encoder layers
- 16 decoder layers
- 1024 hidden size
testing
install transformers fork with GQA updates for t5 (⚠️WIP🚧):
pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
then
# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
model = AutoModelForSeq2SeqLM.from_pretrained(
"BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
print(
tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
)
Quick eval
Quick eval for: BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
boolq | 2 | none | 0 | acc | ↑ | 0.7040 | ± | 0.0080 |
openbookqa | 1 | none | 0 | acc | ↑ | 0.1580 | ± | 0.0163 |
none | 0 | acc_norm | ↑ | 0.2420 | ± | 0.0192 | ||
piqa | 1 | none | 0 | acc | ↑ | 0.6132 | ± | 0.0114 |
none | 0 | acc_norm | ↑ | 0.6159 | ± | 0.0113 | ||
social_iqa | 0 | none | 0 | acc | ↑ | 0.4319 | ± | 0.0112 |
tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.2898 | ± | N/A |
tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.3295 | ± | N/A |
tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.2980 | ± | N/A |
winogrande | 1 | none | 0 | acc | ↑ | 0.5020 | ± | 0.0141 |
Training and evaluation data
used config 'all'
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 8e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 4
- optimizer: Use paged_ademamix_32bit and the args are: No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0