---
library_name: transformers
language:
- en
license: apache-2.0
base_model: BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1
tags:
- gqa
- t5
- instruct
datasets:
- pszemraj/infinity-instruct-7m-T2T_en
pipeline_tag: text2text-generation
---


# tFINE-680m-e32-d16-infinity_instruct-L2

this is an instruction-tuned version of a pretrained t5 with GQA.

## Model description

This model is a fine-tuned version of [BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1](https://huggingface.co/BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1) on the pszemraj/infinity-instruct-7m-T2T_en dataset (config `deduped-L2`).

It achieves the following results on the evaluation set:
- Loss: 1.3139
- Num Input Tokens Seen: 361724696

## usage

prerequisite: you need to have [t5-gqa fork of transformers installed](https://huggingface.co/BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan#testing), and accelerate.

```py
from transformers import pipeline

pipe = pipeline(
    "text2text-generation",
    model="BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2",
    device_map="auto",
)
prompt = "Write me a python fn that demonstrates an advanced sorting algorithm"
res = pipe(
    prompt, max_new_tokens=384, num_beams=4, early_stopping=True, repetition_penalty=1.1
)
print(res[0]["generated_text"])
```

## Quick eval

Quick eval for:	`BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2`


hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|boolq        |      2|none  |     0|acc     |↑  |0.6364|±  |0.0084|
|openbookqa   |      1|none  |     0|acc     |↑  |0.1480|±  |0.0159|
|             |       |none  |     0|acc_norm|↑  |0.2860|±  |0.0202|
|piqa         |      1|none  |     0|acc     |↑  |0.6083|±  |0.0114|
|             |       |none  |     0|acc_norm|↑  |0.6132|±  |0.0114|
|social_iqa   |      0|none  |     0|acc     |↑  |0.3854|±  |0.0110|
|tinyArc      |      0|none  |    25|acc_norm|↑  |0.3122|±  |   N/A|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.3356|±  |   N/A|
|tinyMMLU     |      0|none  |     0|acc_norm|↑  |0.2793|±  |   N/A|
|winogrande   |      1|none  |     0|acc     |↑  |0.5201|±  |0.0140|

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2.5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 8
- optimizer: Use paged_ademamix_32bit and the args are:
No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.02
- num_epochs: 1.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Input Tokens Seen |
|:-------------:|:------:|:----:|:---------------:|:-----------------:|
| 1.4008        | 0.2534 | 1000 | 1.4020          | 91375832          |
| 1.3456        | 0.5068 | 2000 | 1.3669          | 182939052         |
| 1.3437        | 0.7602 | 3000 | 1.3378          | 274855796         |