| Loading tokenizer: /tmp/eval/multilingual_32k.model |
| Loading model: /tmp/eval/best_model.pt |
| Model loaded: 3.04B parameters on cuda |
|
|
| ============================================================ |
| BELEBELE EVALUATION β Multilingual 3B GPT |
| ============================================================ |
|
|
|
|
| Evaluating EN (eng_Latn)... |
| [EN] 50/900 β accuracy so far: 22.0 |
| [EN] 100/900 β accuracy so far: 29.0 |
| [EN] 150/900 β accuracy so far: 32.7 |
| [EN] 200/900 β accuracy so far: 30.5 |
| [EN] 250/900 β accuracy so far: 32.0 |
| [EN] 300/900 β accuracy so far: 31.7 |
| [EN] 350/900 β accuracy so far: 32.9 |
| [EN] 400/900 β accuracy so far: 33.5 |
| [EN] 450/900 β accuracy so far: 32.9 |
| [EN] 500/900 β accuracy so far: 31.4 |
| [EN] 550/900 β accuracy so far: 32.0 |
| [EN] 600/900 β accuracy so far: 32.2 |
| [EN] 650/900 β accuracy so far: 32.6 |
| [EN] 700/900 β accuracy so far: 32.3 |
| [EN] 750/900 β accuracy so far: 32.7 |
| [EN] 800/900 β accuracy so far: 32.0 |
| [EN] 850/900 β accuracy so far: 31.9 |
| [EN] 900/900 β accuracy so far: 31.8 |
| β
EN: 31.8 |
|
|
| Evaluating HE (heb_Hebr)... |
| [HE] 50/900 β accuracy so far: 20.0 |
| [HE] 100/900 β accuracy so far: 25.0 |
| [HE] 150/900 β accuracy so far: 24.0 |
| [HE] 200/900 β accuracy so far: 26.0 |
| [HE] 250/900 β accuracy so far: 25.2 |
| [HE] 300/900 β accuracy so far: 25.7 |
| [HE] 350/900 β accuracy so far: 24.9 |
| [HE] 400/900 β accuracy so far: 24.8 |
| [HE] 450/900 β accuracy so far: 24.9 |
| [HE] 500/900 β accuracy so far: 24.2 |
| [HE] 550/900 β accuracy so far: 25.1 |
| [HE] 600/900 β accuracy so far: 25.3 |
| [HE] 650/900 β accuracy so far: 25.7 |
| [HE] 700/900 β accuracy so far: 25.4 |
| [HE] 750/900 β accuracy so far: 25.9 |
| [HE] 800/900 β accuracy so far: 26.2 |
| [HE] 850/900 β accuracy so far: 26.8 |
| [HE] 900/900 β accuracy so far: 27.0 |
| β
HE: 27.0 |
|
|
| Evaluating AR (arb_Arab)... |
| [AR] 50/900 β accuracy so far: 28.0 |
| [AR] 100/900 β accuracy so far: 25.0 |
| [AR] 150/900 β accuracy so far: 24.7 |
| [AR] 200/900 β accuracy so far: 29.5 |
| [AR] 250/900 β accuracy so far: 30.8 |
| [AR] 300/900 β accuracy so far: 30.0 |
| [AR] 350/900 β accuracy so far: 28.6 |
| [AR] 400/900 β accuracy so far: 28.2 |
| [AR] 450/900 β accuracy so far: 28.7 |
| [AR] 500/900 β accuracy so far: 27.2 |
| [AR] 550/900 β accuracy so far: 27.5 |
| [AR] 600/900 β accuracy so far: 27.0 |
| [AR] 650/900 β accuracy so far: 27.7 |
| [AR] 700/900 β accuracy so far: 28.1 |
| [AR] 750/900 β accuracy so far: 28.9 |
| [AR] 800/900 β accuracy so far: 29.1 |
| [AR] 850/900 β accuracy so far: 28.9 |
| [AR] 900/900 β accuracy so far: 28.4 |
| β
AR: 28.4 |
|
|
| Evaluating FA (pes_Arab)... |
| [FA] 50/900 β accuracy so far: 32.0 |
| [FA] 100/900 β accuracy so far: 33.0 |
| [FA] 150/900 β accuracy so far: 30.7 |
| [FA] 200/900 β accuracy so far: 30.5 |
| [FA] 250/900 β accuracy so far: 28.4 |
| [FA] 300/900 β accuracy so far: 29.0 |
| [FA] 350/900 β accuracy so far: 29.4 |
| [FA] 400/900 β accuracy so far: 30.0 |
| [FA] 450/900 β accuracy so far: 30.2 |
| [FA] 500/900 β accuracy so far: 30.8 |
| [FA] 550/900 β accuracy so far: 30.7 |
| [FA] 600/900 β accuracy so far: 30.3 |
| [FA] 650/900 β accuracy so far: 29.5 |
| [FA] 700/900 β accuracy so far: 29.4 |
| [FA] 750/900 β accuracy so far: 28.4 |
| [FA] 800/900 β accuracy so far: 27.8 |
| [FA] 850/900 β accuracy so far: 28.0 |
| [FA] 900/900 β accuracy so far: 28.2 |
| β
FA: 28.2 |
|
|
| ============================================================ |
| OVERALL: 28.9 |
| Random baseline: 25.0 |
| ============================================================ |
|
|
| Results saved to /tmp/eval/belebele_results.json |
|
|