My results have unreadable data in them; how to resolve
#12
by
richylyq
- opened
I am using this model for English to Chinese Translation but i am seeing this whole load of data alongside; how do i clean it up and get a nice clean string of chinese characters?
[{'translation_text': '{\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的小苹果 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的小苹果 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你 是我的 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的小苹果 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080}你是我的 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080} - - - {\\fn黑体 {\\fn黑体 {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE - - {\\fn黑体\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080} - - {\\fn黑体\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080} - - {\\fn黑体 - {\\fnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfn黑体\\shad0\\3aHBE\\4aH00\\shad0\\3aHBE\\4aH00\\shad0\\3aHBE\\4aH00\\shad0\\3aHBE\\shad0\\3aHBE\\fs22\\bord1\\shad0\\3aHBE\\4aH00\\fscx\\4aH00\\fscx67\\bord1\\shad0\\3aHBE\\4aH00\\fscx\\4aH00\\shad0\\3aHBE\\4aH00\\fscx6aH00\\fscx\\4aH00\\fscx6aH00\\fscx6aH00\\fscx67\\fscy\\4aH00\\fscx6aH00\\fscx67\\fscy66\\2cHFFFFFF\\3cH808080} {\\fnfnfnfnfn黑体\\fscx6aH00\\fscx6aH00\\fscx6aH00\\fscx6aH00\\fscx6aH00\\fscx6aH00\\fscx67\\fscy\\4aH00\\fscx67\\fscy\\4aH00\\fscx67\\fscy\\4aH00\\fscx67\\fscy\\4aH00\\fscx67\\fscy\\4aH00\\fscx67\\fscy\\4aH\\4aH00\\fscx67\\fscy\\4aH'}]
Check this link: https://huggingface.co/docs/transformers/model_doc/marian
from transformers import MarianMTModel, MarianTokenizer
src_text = [
'Hello, Good to see you.',
"It's a beautiful day!",
'Good moods are the most important.',
]
model_name = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
res = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(res)
the result is:
['你好,很高兴见到你。', '这是一个美丽的一天!', '良好的情绪是最重要的。']