SemiticGPT / eval /belebele_3b.log
ronnengmail's picture
Upload eval/belebele_3b.log with huggingface_hub
b408473 verified
Loading tokenizer: /tmp/eval/multilingual_32k.model
Loading model: /tmp/eval/best_model.pt
Model loaded: 3.04B parameters on cuda
============================================================
BELEBELE EVALUATION β€” Multilingual 3B GPT
============================================================
Evaluating EN (eng_Latn)...
[EN] 50/900 β€” accuracy so far: 22.0%
[EN] 100/900 β€” accuracy so far: 29.0%
[EN] 150/900 β€” accuracy so far: 32.7%
[EN] 200/900 β€” accuracy so far: 30.5%
[EN] 250/900 β€” accuracy so far: 32.0%
[EN] 300/900 β€” accuracy so far: 31.7%
[EN] 350/900 β€” accuracy so far: 32.9%
[EN] 400/900 β€” accuracy so far: 33.5%
[EN] 450/900 β€” accuracy so far: 32.9%
[EN] 500/900 β€” accuracy so far: 31.4%
[EN] 550/900 β€” accuracy so far: 32.0%
[EN] 600/900 β€” accuracy so far: 32.2%
[EN] 650/900 β€” accuracy so far: 32.6%
[EN] 700/900 β€” accuracy so far: 32.3%
[EN] 750/900 β€” accuracy so far: 32.7%
[EN] 800/900 β€” accuracy so far: 32.0%
[EN] 850/900 β€” accuracy so far: 31.9%
[EN] 900/900 β€” accuracy so far: 31.8%
βœ… EN: 31.8% (286/900)
Evaluating HE (heb_Hebr)...
[HE] 50/900 β€” accuracy so far: 20.0%
[HE] 100/900 β€” accuracy so far: 25.0%
[HE] 150/900 β€” accuracy so far: 24.0%
[HE] 200/900 β€” accuracy so far: 26.0%
[HE] 250/900 β€” accuracy so far: 25.2%
[HE] 300/900 β€” accuracy so far: 25.7%
[HE] 350/900 β€” accuracy so far: 24.9%
[HE] 400/900 β€” accuracy so far: 24.8%
[HE] 450/900 β€” accuracy so far: 24.9%
[HE] 500/900 β€” accuracy so far: 24.2%
[HE] 550/900 β€” accuracy so far: 25.1%
[HE] 600/900 β€” accuracy so far: 25.3%
[HE] 650/900 β€” accuracy so far: 25.7%
[HE] 700/900 β€” accuracy so far: 25.4%
[HE] 750/900 β€” accuracy so far: 25.9%
[HE] 800/900 β€” accuracy so far: 26.2%
[HE] 850/900 β€” accuracy so far: 26.8%
[HE] 900/900 β€” accuracy so far: 27.0%
βœ… HE: 27.0% (243/900)
Evaluating AR (arb_Arab)...
[AR] 50/900 β€” accuracy so far: 28.0%
[AR] 100/900 β€” accuracy so far: 25.0%
[AR] 150/900 β€” accuracy so far: 24.7%
[AR] 200/900 β€” accuracy so far: 29.5%
[AR] 250/900 β€” accuracy so far: 30.8%
[AR] 300/900 β€” accuracy so far: 30.0%
[AR] 350/900 β€” accuracy so far: 28.6%
[AR] 400/900 β€” accuracy so far: 28.2%
[AR] 450/900 β€” accuracy so far: 28.7%
[AR] 500/900 β€” accuracy so far: 27.2%
[AR] 550/900 β€” accuracy so far: 27.5%
[AR] 600/900 β€” accuracy so far: 27.0%
[AR] 650/900 β€” accuracy so far: 27.7%
[AR] 700/900 β€” accuracy so far: 28.1%
[AR] 750/900 β€” accuracy so far: 28.9%
[AR] 800/900 β€” accuracy so far: 29.1%
[AR] 850/900 β€” accuracy so far: 28.9%
[AR] 900/900 β€” accuracy so far: 28.4%
βœ… AR: 28.4% (256/900)
Evaluating FA (pes_Arab)...
[FA] 50/900 β€” accuracy so far: 32.0%
[FA] 100/900 β€” accuracy so far: 33.0%
[FA] 150/900 β€” accuracy so far: 30.7%
[FA] 200/900 β€” accuracy so far: 30.5%
[FA] 250/900 β€” accuracy so far: 28.4%
[FA] 300/900 β€” accuracy so far: 29.0%
[FA] 350/900 β€” accuracy so far: 29.4%
[FA] 400/900 β€” accuracy so far: 30.0%
[FA] 450/900 β€” accuracy so far: 30.2%
[FA] 500/900 β€” accuracy so far: 30.8%
[FA] 550/900 β€” accuracy so far: 30.7%
[FA] 600/900 β€” accuracy so far: 30.3%
[FA] 650/900 β€” accuracy so far: 29.5%
[FA] 700/900 β€” accuracy so far: 29.4%
[FA] 750/900 β€” accuracy so far: 28.4%
[FA] 800/900 β€” accuracy so far: 27.8%
[FA] 850/900 β€” accuracy so far: 28.0%
[FA] 900/900 β€” accuracy so far: 28.2%
βœ… FA: 28.2% (254/900)
============================================================
OVERALL: 28.9% (1039/3600)
Random baseline: 25.0%
============================================================
Results saved to /tmp/eval/belebele_results.json