mandarin to english and english to mandarin

#25
by skoll520 - opened

I'm not mandarin speaker, but seems like m4t cut the text in the middle when translating,
Also it translate "Penrose" as <.unk.> token.
P.S. : had to insert periods in the token or hugging face don't show it, very strange.

"In the realm of theoretical astrophysics, the study of cosmic singularities poses profound challenges that transcend our current comprehension of the universe. These gravitational anomalies, as predicted by Einstein's field equations, are encapsulated within event horizons, where the very fabric of spacetime is distorted beyond recognition. The singularity theorems of Hawking and Penrose have offered profound insights into the nature of black holes, suggesting that at their core lies a point of infinite density, known as a gravitational singularity. The implications of these singularities extend into the heart of our understanding of physics and cosmology, beckoning us to reconcile general relativity with quantum mechanics in a unified theory of everything. The enigma of singularities challenges our fundamental understanding of the cosmos, pushing the boundaries of human knowledge to the very limits of existence."

"在理论天体物理学领域,宇宙奇点的研究提出了超越我们目前对宇宙的理解的深刻挑战.这些引力异常,正如爱因斯坦的场方程所预测的,被封装在事件地平线内,在那里,时空的结构被扭曲到无法识别的地步. 霍金和<.unk.>罗斯的奇点定理为黑洞的本质提供了深刻的见解,表明它们的核心是一个无限密度的点,称为引力奇点. 这些奇点的含义延伸到我们对物理学和宇宙学的理解的核心,促使我们在一切的统一理论中调和一般相对论与量子力学. 奇点的谜题挑战了我们对宇宙的基本理解,将人类知识的极限推向了存在的极限."

cc @ylacombe - is the max length being hit here maybe? Should we display a warning

Let me take a look!

At the moment, this is still using the original implementation, which uses a soft max length. The formula for max length is thus: min(1024, input_token_length + 200)...
I would say that the model uses more tokens for Mandarin than for English, explaining why it cuts mid-sentence.

Are you using the HF implementation or this space ?

P.S. : had to insert periods in the token or hugging face don't show it, very strange.

By default, SeamlessM4T tokenizer skip special tokens when decoding. <unk> is a special token thus is skipped!

Let me know if you have any further questions.

I used this space. Also, my P.S. is about this domain itself, not the space or the model. If you try remove the periods of this: <.unk.> and let a comment, it disappear.

Thanks, so my comment should be the right answer then!

Sign up or log in to comment