pszemraj's picture
Update README.md
57ce7bd verified
---
library_name: transformers
language:
- en
license: apache-2.0
base_model: BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024
tags:
- flan
- t5
- gqa
- instruct
datasets:
- pszemraj/flan-subsets-deduped
---
# tFINE-680m-e32-d16-gqa-flan
FLAN-tuned variant of a tFINE (t5) model with GQA.
- 32 encoder layers
- 16 decoder layers
- 1024 hidden size
## testing
install [transformers fork with GQA updates for t5](https://github.com/pszemraj/transformers/tree/t5-gqa) (⚠️WIP🚧):
```sh
pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
```
then
```py
# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
model = AutoModelForSeq2SeqLM.from_pretrained(
"BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
print(
tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
)
```
## Quick eval
Quick eval for: `BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan`
hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|boolq | 2|none | 0|acc |↑ |0.7040|± |0.0080|
|openbookqa | 1|none | 0|acc |↑ |0.1580|± |0.0163|
| | |none | 0|acc_norm|↑ |0.2420|± |0.0192|
|piqa | 1|none | 0|acc |↑ |0.6132|± |0.0114|
| | |none | 0|acc_norm|↑ |0.6159|± |0.0113|
|social_iqa | 0|none | 0|acc |↑ |0.4319|± |0.0112|
|tinyArc | 0|none | 25|acc_norm|↑ |0.2898|± | N/A|
|tinyHellaswag| 0|none | 10|acc_norm|↑ |0.3295|± | N/A|
|tinyMMLU | 0|none | 0|acc_norm|↑ |0.2980|± | N/A|
|winogrande | 1|none | 0|acc |↑ |0.5020|± |0.0141|
## Training and evaluation data
used config 'all'
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 8e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 4
- optimizer: Use paged_ademamix_32bit and the args are:
No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0