Over the weekend after a failed initial run I got excited by Pete's success Jamba Tuning and decided to throw a little compute on a similar-sized dataset (the main shisa-v1 bilingual tuning set).
Like my initial runs, training graphs look fine, but the results were less than spectacular.
Here are the JA MT-Bench evals for the 2416 checkpoint (eval/loss plateau) and the 4228 (3 epoch) tune:
shisa-jamba-v1-checkpoint-2416 2.491525
shisa-jamba-v1-checkpoint-4228 2.508475
You can view the answers in the repo (lots of repetitions and nonsense) and compare to proper JA MT-Bench scores from my testing.
While an "unsuccessful" experiment, it was still worth the practice, although I got a little excited and should have gone w/ my more typical lighter testing obviously.
This kicks off official shisa-v2
base model evaluation. I was a bit hesitant about throwing this model out there (since it's useless as an artifact), but since I've actually made the in-process code available while working on it, I'll share this as well just in case (and to do this writeup).
Here is the current full code/steps for Axolotl training and eval (modified llm-judge inferencing code):
- https://github.com/shisa-ai/shisa-v2/tree/main/_base-evals/jamba/axolotl
- https://github.com/shisa-ai/shisa-v2/tree/main/_base-evals/jamba/eval
Thanks to Pete for the useful initial report and the axolotl team for their fast integration of Jamba (way better than my raw tune code).
- Downloads last month
- 8