BEE-spoke-data
/

tFINE-680m-e32-d16-gqa-flan

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

tFINE-680m-e32-d16-gqa-flan / README.md

pszemraj's picture

Update README.md

57ce7bd verified 2 months ago

|

history blame contribute delete

2.92 kB

	---
	library_name: transformers
	language:
	- en
	license: apache-2.0
	base_model: BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024
	tags:
	- flan
	- t5
	- gqa
	- instruct
	datasets:
	- pszemraj/flan-subsets-deduped
	---


	# tFINE-680m-e32-d16-gqa-flan

	FLAN-tuned variant of a tFINE (t5) model with GQA.

	- 32 encoder layers
	- 16 decoder layers
	- 1024 hidden size

	## testing


	install [transformers fork with GQA updates for t5](https://github.com/pszemraj/transformers/tree/t5-gqa) (⚠️WIP🚧):

	```sh
	pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
	```

	then

	```py
	# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
	model = AutoModelForSeq2SeqLM.from_pretrained(
	"BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
	)

	prompt = "What is the capital of France?"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
	print(
	tokenizer.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
	)[0]
	)
	```

	## Quick eval

	Quick eval for: `BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan`


	hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|------\|
	\|boolq \| 2\|none \| 0\|acc \|↑ \|0.7040\|± \|0.0080\|
	\|openbookqa \| 1\|none \| 0\|acc \|↑ \|0.1580\|± \|0.0163\|
	\| \| \|none \| 0\|acc_norm\|↑ \|0.2420\|± \|0.0192\|
	\|piqa \| 1\|none \| 0\|acc \|↑ \|0.6132\|± \|0.0114\|
	\| \| \|none \| 0\|acc_norm\|↑ \|0.6159\|± \|0.0113\|
	\|social_iqa \| 0\|none \| 0\|acc \|↑ \|0.4319\|± \|0.0112\|
	\|tinyArc \| 0\|none \| 25\|acc_norm\|↑ \|0.2898\|± \| N/A\|
	\|tinyHellaswag\| 0\|none \| 10\|acc_norm\|↑ \|0.3295\|± \| N/A\|
	\|tinyMMLU \| 0\|none \| 0\|acc_norm\|↑ \|0.2980\|± \| N/A\|
	\|winogrande \| 1\|none \| 0\|acc \|↑ \|0.5020\|± \|0.0141\|

	## Training and evaluation data

	used config 'all'

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 8e-05
	- train_batch_size: 4
	- eval_batch_size: 2
	- seed: 17868
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 256
	- total_eval_batch_size: 4
	- optimizer: Use paged_ademamix_32bit and the args are:
	No additional optimizer arguments
	- lr_scheduler_type: constant_with_warmup
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 1.0