Same a Voicelab/trurl-2-13b that was flagged (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/202), the MMLU score is way too high for a 13B model.

Can it be flagged?

Good catch! Since it contains trurl-13b in the name, it's likely it's used the above model as a base, so I'm flagging it for the moment.
However, in the sake of fairness, could you open an issue on their model repo to ask what they trained on/used as base?

The model file sizes seem consistent with other 13b models. Can users rewrite history by force-pushing to model repos?

Further clarification for anyone (like me) who missed the Voicelab discussion, the trurl-2-13b model's training included much of the MMLU test, so of course it scores exceedingly well on the test for a 13b model. The Voicelab team is re-training without the MMLU dataset but doesn't expect much difference from base llama-2-13b; their focus is on Polish knowledge.

