Stephen Oates's picture
8 8

Stephen Oates PRO

soates
Β·

AI & ML interests

None yet

Recent Activity

upvoted a paper 24 days ago
Qwen2.5 Technical Report
liked a model about 1 month ago
Datou1111/shou_xin
View all activity

Organizations

None yet

soates's activity

upvoted an article about 9 hours ago
view article
Article

Mastering Tensor Dimensions in Transformers

By not-lain β€’
β€’ 18
upvoted an article 4 months ago
view article
Article

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

β€’ 216
upvoted 2 articles 5 months ago
view article
Article

Llama-3.1-Storm-8B: Improved SLM with Self-Curation + Model Merging

By akjindal53244 β€’
β€’ 75
view article
Article

A failed experiment: Infini-Attention, and why we should keep trying?

β€’ 57
upvoted an article 7 months ago
reacted to BramVanroy's post with πŸ‘ 10 months ago
view post
Post
2394
Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?
  • 4 replies
Β·