Add size and language to table?

#15
by endolith - opened

Would it make sense to add model size and languages to the table? Some say "multilingual" in the name but it's not clear about the ones that don't. (Column could say "English" if English-only, "100" if it was trained on 100 languages, etc.)

And if we're running them locally it would be good to know that a 438 MB model scores almost as good as a 4.96 GB model when making a decision of what to try.

Massive Text Embedding Benchmark org
edited Jul 11, 2023

Good ideas!

  1. Language - The problems is that it's ambiguous, e.g. LASER2 is supposed to be multilingual across hundreds of languages, but it's often outperformed by English "monolingual" models ; OpenAI doesn't even specify the languages for text-embedding-ada-002 & I imagine it works for other Latin languages too. One could add a Num Languages column that simply takes the length of all languages included in the metadata (e.g. for https://huggingface.co/intfloat/multilingual-e5-large); For multilingual-e5-large it would be ~85 & ada-002 it would be 1, but again I think it would be quite noisy.
    Best would be to just benchmark these models on all the languages & have an average across multiple languages not just English! Am working on adding additional average scores.

  2. Size - Adding the file size or parameter size would be great, the problem here is just how to get it automatically from a model on HF. If you know how to do it, please let me know! I think for parameter count, safetensors would be one way (https://huggingface.co/docs/safetensors/metadata_parsing) but many models don't have ST weights, so would still need a manual map for those. A manual map is fine too imo as the leaderboard doesn't change too often - Feel free to create one via PR / here if you want to!

"but it's often outperformed by English "monolingual" models"

Outperformed on multilingual tasks, though?

"Best would be to just benchmark these models on all the languages & have an average across multiple languages not just English!"

Good point! Benchmarking what it can actually do is indeed better than listing what it is meant to do.

"If you know how to do it, please let me know!"

It's not just the size of the pytorch_model.bin file? Sorry, I'm not an expert. I was thinking that's a rough proxy for how much memory it will take up on my machine and how quickly it will run.

Oh, (some?) model cards have language tags at the top, like https://huggingface.co/intfloat/multilingual-e5-large has "94 languages" linked to https://huggingface.co/models?language=multilingual while https://huggingface.co/intfloat/e5-large has "English" with https://huggingface.co/models?language=en

(But, yes, actual multilingual performance would be better to include.)

Massive Text Embedding Benchmark org

"but it's often outperformed by English "monolingual" models"

Outperformed on multilingual tasks, though?

"Best would be to just benchmark these models on all the languages & have an average across multiple languages not just English!"

Good point! Benchmarking what it can actually do is indeed better than listing what it is meant to do.

"If you know how to do it, please let me know!"

It's not just the size of the pytorch_model.bin file? Sorry, I'm not an expert. I was thinking that's a rough proxy for how much memory it will take up on my machine and how quickly it will run.

Yeah the size of the pytorch_model.bin would be a good proxy, but I don't know if it's possible to get that automatically without downloading the bin file?
Else we can download each model's bin file, check the size, then delete it or we have to do a manual map and go through each model.

git clone --no-checkout https://huggingface.co/intfloat/multilingual-e5-large
cd multilingual-e5-large\
λ git lfs ls-files -s
020afdebf2 - model.safetensors (2.2 GB)
bb5a52503a - onnx/model.onnx (546 KB)
0cf1883fee - onnx/model.onnx_data (2.2 GB)
cfc8146abe - onnx/sentencepiece.bpe.model (5.1 MB)
62c24cdc13 - onnx/tokenizer.json (17 MB)
9aaa222c5a - pytorch_model.bin (2.2 GB)
cfc8146abe - sentencepiece.bpe.model (5.1 MB)
62c24cdc13 - tokenizer.json (17 MB)

Even better:

λ git lfs ls-files -s --include=pytorch_model.bin
9aaa222c5a - pytorch_model.bin (2.2 GB)

(Thanks, ChatGPT and Stack Overflow!)

Massive Text Embedding Benchmark org

Nice! Added 👍☀️ Lmk if you'd do sth differently

Sign up or log in to comment