Clarification about multilingual training data

#1
by bofenghuang - opened

Hi,

Thank you for this excellent work and for publishing the models!

From what I've understood in the paper, the phi-3-medium model uses the same tokenizer (32k) as the phi-3-mini and is trained on the same data for slightly more epochs. This is intended to test the data mixture on larger model sizes. Therefore, this model is supposed to be English-only, like the mini model?

However, in the model card, I noticed this model is tagged as multilingual and it mentions "(including 10% multilingual)" in the Datasets section and there is an evaluation on a "Multilingual" benchmark. There seems to be something I've missed, possibly related to differences in fine-tuning? Could you please provide some clarification on this? Any additional details would be greatly appreciated!

Additionally, the benchmark results on the Math and Multilingual categories do not seem to be coherent between the small and medium models.

In microsoft/Phi-3-small-8k-instruct:

image.png

In microsoft/Phi-3-medium-4k-instruct:

image.png

Microsoft org

Thank you for your interest! The intended use for Phi-3 model family is for English. For Small and Medium we have some multilingual data but the model is optimized for English tasks.
We release the model weights and very eager to learn more about the model quality in your usecased.

nguyenbh changed discussion status to closed

Sign up or log in to comment