the chinese training data of the model is contaminated

#165

by bookwoods123 - opened Oct 22, 2024

Discussion

bookwoods123

Oct 22, 2024

•

edited Oct 22, 2024

I have tested many long audio recordings that are over half an hour long, the text contains many of the following fields, which is not present in the original audio
请不吝点赞订阅转发打赏支持明镜与点点栏目
字幕志愿者杨茜茜优优独播剧场

this situation occurs in both openai/whisper-large-v3 and openai/whisper-large-v3-turbo, I am very certain that my audios don't contain these words

Zeyu0601

Oct 25, 2024

太离谱了

varanium

Nov 23, 2024

我也一样

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment