the chinese training data of the model is contaminated

#165
by bookwoods123 - opened

I have tested many long audio recordings that are over half an hour long, the text contains many of the following fields, which is not present in the original audio
请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
字幕志愿者 杨茜茜优优独播剧场

this situation occurs in both openai/whisper-large-v3 and openai/whisper-large-v3-turbo, I am very certain that my audios don't contain these words

太离谱了

Sign up or log in to comment