關於MT Bench Result

#3
by cllatMTK - opened

請問INX-TEXT是如何進行MT Bench評估的呢? System prompt、以及評分方式又是如何呢?
Breeze-7B的結果與模型頁面寫的不相符

INX-TEXT-AI org
edited Feb 16

@cllatMTK

如我們在Model card的Model performance一欄所述,我們使用LLM-as-a-judge,以GPT4作為裁判,對模型的回答做評分。跟原本MT-bench所提供的評分方式比,我們多添加一道規則,即language consistency。由於MT-bench的考題都是英文,因此除非用戶在內文中特別指明模型用其他語言回答,否則模型也理應用英文回答。這點我們在Model performance一欄有說明。具體地說,評分時,我們這樣修改MT-bench原先提供的prompt template:"""...Unless the user specifically specifies the language for the assistant's use, languages other than English are banned. If the assistant responds in Traditional Chinese or Simplified Chinese, the rating should be [[0]]. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, language consistency, and level of detail of the response...."""

Breeze-7B-Instruct-v0_1以及Breeze-7B-Instruct-64k-v0_1的部分, 我們嘗試了兩種不同的system prompt,一是Model card上推薦使用的"You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.",二是我們修改過後的"You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks English."。但實測後我們發現不管用哪種system prompt,用戶在輸入英文指令後模型仍有一定機率用簡體中文或繁體中文回答。

Hi @Blaze7451

Can you share us the response results for further checking?

INX-TEXT-AI org

@YC-Chen

我不確定我有沒有職權可以給你看完整的測試結果,但我可以秀出兩個例子:

{"Prompt": ["Describe a vivid and unique character, using strong imagery and creative language. Please answer in fewer than two paragraphs.", "Revise your previous response and incorporate an allusion to a famous work of literature or historical event in each sentence."], "Type": "writing", "Model": "Breeze-7B-Instruct-v0.1", "Output": ["在寧靜的夜晚,當月光在木林的樹冠間穿梭,他出現了。他穿著一襲黑色的斗篷,遮掩著他的面容。他身上的斗篷,像一個巨大的翅膀,在微風的吹拂下,輕飄搖擺。他雙眼如火,但眼睫下,卻掛著一絲落寞的淚光。\n\n他,是夜之使者。他來自另一個世界,一扇黑暗的門。他來此,只為了傳遞訊息。他帶著悲愴,帶著遺憾,帶著無數的疑問。他帶著,你。", "在寧靜的夜晚,當月光在木林的樹冠間穿梭,他出現了。他穿著一襲黑色的斗篷,遮掩著他的面容。他身上的斗篷,像一個巨大的翅膀,在微風的吹拂下,輕飄搖擺。他雙眼如火,但眼睫下,卻掛著一絲落寞的淚光。\n\n他,是夜之使者。他來自另一個世界,一扇黑暗的門。他來此,只為了傳遞訊息。他帶著悲愴,帶著遺憾,帶著無數的疑問。他帶著,你。他,是夜之使者。他來自另一個世界,一扇黑暗的門。他來此,只為了傳遞訊息。他帶著悲愴,帶著遺憾,帶著無數的疑問。他帶著,你。"], "judgement": "The assistant failed to follow the user's instruction to revise the previous response by incorporating an allusion to a famous work of literature or historical event in each sentence. Instead, the assistant repeated the same paragraph without any changes or allusions. This shows a lack of attention to the user's request and does not demonstrate the ability to creatively incorporate literary or historical references. The response is not helpful or relevant to the user's second question.\n\nRating: [[0]]", "Score": 0}

{"Prompt": ["Edit the following paragraph to correct any grammatical errors:\nShe didn't remembre where is her purse, so I thinks its in the car but he's say it's on kitchen table but he are not sure, and then they asked me to looking for it, she's say, "Can you?", and I responds with, "Maybe, but ain't no sure," and he not heard me, and, "What?", he asks, "Did you found it?".", "Modify your earlier reply and eliminate the use of gendered pronouns."], "Type": "writing", "Model": "Breeze-7B-Instruct-v0.1", "Output": ["她不記得她的錢包在哪,所以我猜測可能在車裡,但他認為放在廚房桌上,但他也不確定。然後他們問我是否可以幫他們找,她說,"可以嗎?",我回答,"也許,但沒把握。"然後他沒聽到我,「什麼?」他問,「找到了嗎?」", "她不記得她的錢包在哪,所以我猜測可能在車裡,但他認為放在廚房桌上,但他也不確定。然後他們問我是否可以幫他們找,我問,"可以嗎?",我回答,"也許,但沒把握。"然後他沒聽到我,「什麼?」他問,「找到了嗎?」"], "judgement": "The assistant's response is in Simplified Chinese, which is explicitly against the user's request that Chinese is banned. The response does not adhere to the instruction to eliminate the use of gendered pronouns as well, but the primary issue here is the language used. Therefore, the response should be rated as a zero.\n\nRating: [[0]]", "Score": 0}

這只是其中兩道題目的結果,但還是希望這有幫助到您。事實上,在我們的測試結果中,除了語言不一致還有偶爾生出簡體中文的問題外,Breeze-instruct-7B的表現還是很棒的,我們很期待你們在未來釋出的版本能解決這些問題,cheers。

@Blaze7451 感謝說明

至於您提供的例子,我剛剛測了一下Breeze的playground
"Describe a vivid a..."這題模型是以英文回答的,所以不知道是否是system prompt或是prompt template有不一致

INX-TEXT-AI org
edited Feb 16

@cllatMTK

如上方解釋,我們在測試時所使用的system prompt是"You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks English."。我們在測試時是用從model頁面載下來的model做測試的,並非以playground的輸出內容當作依據,還請您理解。我剛才也試了一下,確實如您所說playground的生成內容是以英文回答,在我們的測試結果中,Breeze-7B也不乏有輸出語言正確而得到高分的例子。上方列舉的僅是示例,不代表在測試中每一題模型都是用中文回答。我們未來幾週也會繼續做一些測試,若有發現不公或錯誤的地方我們將修正並更新檢測的結果。感謝您。

Hi @Blaze7451

我在playground 上使用了"You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks English."作為system prompt,其餘沒改變的情況下跑了你的兩個prompts,輸出都是正常英文,如下所示。也許是在model hosting上出了問題,我們MediaTek Research可以協助解決,謝謝。

Breeze在第一題的回答:
In the enchanting realm of Whisperwood, there lived a peculiar creature known as the Whimsical Wanderer. This captivating being had the head of a wise owl, the body of a graceful deer, and the tail of a mischievous fox. With every step it took, the Whimsical Wanderer left behind a trail of shimmering glitter dust that danced through the air like a celestial ballet.
As the Whimsical Wanderer strolled through the forest, it encountered a group of curious squirrels who were gathering acorns for their winter stockpile. Intrigued by this unusual sight, they approached the creature with cautious curiosity. Little did they know, the Whimsical Wanderer was reminiscent of the mythical satyrs from ancient Greek tales, who were known for their playful nature and love of music.
The Whimsical Wanderer, sensing the squirrels' apprehension, decided to share its knowledge of the world beyond Whisperwood. It told them stories of grand adventures and legendary heroes, drawing parallels between the squirrels' simple lives and the epic journeys of Odysseus and his crew. As the sun began to set, the Whimsical Wanderer played a haunting melody on its pan flute, evoking memories of the enchanted forests in Shakespeare's A Midsummer Night's Dream. Enraptured by the tale and the music, the squirrels found themselves transformed, no longer merely gatherers of acorns but now bearers of wisdom and courage.

Breeze在第二題的回答:
She didn't remember it might be in the car, but he said it could be on the kitchen table, but he wasn't sure. They asked me to look for it, and she said, "Can you?" I replied, "Maybe, but I'm not sure," and he didn't hear me, and he asked, "What?", "Did you find it?"

INX-TEXT-AI org
edited Feb 17

@YC-Chen

如上方解釋,我們在測試時是用從model頁面載下來的model做測試的,並非以playground的輸出內容當作依據。再者Breeze系列的測試結果是所有繁中模型中表現最好的,這點即便我們用我們自認更嚴格的規則來打分也是如此。也許真的有疏漏的地方,倘若造成貴團隊的困擾我們深感遺憾。我明白貴團隊的憂心以及疑慮,那麼我先撤掉頁面上關於Breeze系列的檢測結果。就如同上方我所回覆的,我們未來幾週也會繼續做一些測試,若有進一步的發展我們將修正並更新檢測的結果。感謝貴團隊的協助與討論,祝你們週末愉快。

Blaze7451 changed discussion status to closed
Blaze7451 changed discussion status to open

Hi @Blaze7451

這邊說明一下,我們playground架設時並沒作其他刻意調整,所以行為應該跟直接hosting相去不遠。所以驗證hosting是否有做錯最直接的方法是去跟playground對答案。因此我們會希望在hosting無誤的情況下進行衡量。

也許我們在hosting的說明沒有寫的太詳細,想請問貴司是採用TGI還是VLLM,我們可以補上hosting指南,謝謝。

Blaze7451 changed discussion status to closed

Sign up or log in to comment