ScisummNet / example.txt
SmartPy's picture
Upload 6 files
dcfa2ec
raw
history blame
16.9 kB
Human evaluation machine translation ( MT ) weigh many aspect translation , include adequacy , fidelity , fluency translation ( Hovy , 1999 ; White O ’ Connell , 1994 ) . A comprehensive catalog MT evaluation technique rich literature give Reeder ( 2001 ) . For part , various human evaluation approach quite expensive ( Hovy , 1999 ) . Moreover , take week month finish . This big problem developer machine translation system need monitor effect daily change system order weed bad idea good ideas . We believe MT progress stem evaluation logjam fruitful research idea wait release 1So call method bilingual evaluation understudy , BLEU . evaluation bottleneck . Developers would benefit inexpensive automatic evaluation quick , language-independent , correlate highly human evaluation . We propose evaluation method paper . How one measure translation performance ? The close machine translation professional human translation , good . This central idea behind proposal . To judge quality machine translation , one measure closeness one reference human translation accord numerical metric . Thus , MT evaluation system require two ingredients : We fashion closeness metric highly successful word error rate metric use speech recognition community , appropriately modify multiple reference translation allow legitimate difference word choice word order . The main idea use weighted average variable length phrase match reference translations . This view give rise family metric use various weight schemes . We select promising baseline metric family . In Section 2 , describe baseline metric detail . In Section 3 , evaluate performance BLEU . In Section 4 , describe human evaluation experiment . In Section 5 , compare baseline metric performance human evaluations . Typically , many β€œ perfect ” translation give source sentence . These translation may vary word choice word order even use words . And yet human clearly distinguish good translation bad one . For example , consider two candidate translation forever hear activity guidebook party direct . Although appear subject , differ markedly quality . For comparison , provide three reference human translation sentence . guarantee military force always command Party . Reference 3 : It practical guide army always heed direction party . It clear good translation , Candidate 1 , share many word phrase three reference translations , Candidate 2 . We shortly quantify notion share Section 2 . 1 . But first observe Candidate 1 share & quot ; It guide action & quot ; Reference 1 , & quot ; & quot ; Reference 2 , & quot ; ensures military & quot ; Reference 1 , & quot ; always & quot ; References 2 3 , & quot ; commands & quot ; Reference 1 , finally & quot ; party & quot ; Reference 2 ( ignore capitalization ) . In contrast , Candidate 2 exhibit far matches , extent less . It clear program rank Candidate 1 high Candidate 2 simply compare ngram match candidate translation reference translations . Experiments large collection translation present Section 5 show ranking ability general phenomenon , artifact toy examples . The primary programming task BLEU implementor compare n-grams candidate n-grams reference translation count number matches . These match positionindependent . The matches , good candidate translation . For simplicity , first focus compute unigram matches . The cornerstone metric familiar precision measure . To compute precision , one simply count number candidate translation word ( unigrams ) occur reference translation divide total number word candidate translation . Unfortunately , MT system overgenerate β€œ reasonable ” words , result improbable , high-precision , translation like example 2 . Intuitively problem clear : reference word consider exhaust matching candidate word identified . We formalize intuition modified unigram precision . To compute , one first count maximum number time word occur single reference translation . Next , one clip total count candidate word maximum reference count , 2adds clip count , divide total ( unclipped ) number candidate words . In Example 1 , Candidate 1 achieve modified unigram precision 17/18 ; whereas Candidate 2 achieve modified unigram precision 8/14 . Similarly , modified unigram precision Example 2 2/7 , even though standard unigram precision 7/7 . Modified n-gram precision compute similarly n : candidate n-gram count corresponding maximum reference count collected . The candidate count clip corresponding reference maximum value , summed , divide total number candidate ngrams . In Example 1 , Candidate 1 achieve modified bigram precision 10/17 , whereas low quality Candidate 2 achieve modified bigram precision 1/13 . In Example 2 , ( implausible ) candidate achieve modified bigram precision 0 . This sort modified n-gram precision score capture two aspect translation : adequacy fluency . A translation use word ( 1-grams ) reference tend satisfy adequacy . The longer n-gram match account fluency . 4 2 . 1 . 1 Modified n-gram precision block text How compute modify n-gram precision multi-sentence test set ? Although one typically evaluate MT system corpus entire documents , basic unit evaluation sentence . A source sentence may translate many target sentences , case abuse terminology refer corresponding target sentence β€œ sentence . ” We first compute n-gram match sentence sentence . Next , add clipped n-gram count candidate sentence divide number candidate n-grams test corpus compute modified precision score , pn , entire test corpus . 4BLEU need match human judgment average test corpus ; score individual sentence often vary human judgments . For example , system produce fluent phrase β€œ East Asian economy ” penalize heavily longer n-gram precision reference happen read β€œ economy East Asia . ” The key BLEU ’ success system treat similarly multiple human translator different style used , effect cancel comparison systems . 2 . 1 . 2 Ranking system use modify n-gram precision To verify modify n-gram precision distinguishes good translation bad translations , compute modified precision number output ( good ) human translator standard ( poor ) machine translation system use 4 reference translation 127 source sentences . The average precision result show Figure 1 . The strong signal differentiate human ( high precision ) machine ( low precision ) striking . The difference becomes strong go unigram precision 4-gram precision . It appear single n-gram precision score distinguish good translation bad translation . To useful , however , metric must also reliably distinguish translation differ greatly quality . Furthermore , must distinguish two human translation differ quality . This latter requirement ensure continued validity metric MT approach human translation quality . To end , obtain human translation someone lack native proficiency source ( Chinese ) target language ( English ) . For comparison , acquire human translation document native English speaker . We also obtain machine translation three commercial systems . These five β€œ systems ” β€” two human three machine β€” score two reference professional human translations . The average modified n-gram precision result show Figure 2 . Each n-gram statistic imply Phrase ( n -gram ) Length ranking : H2 ( Human-2 ) good H1 ( Human1 ) , big drop quality H1 S3 ( Machine/System-3 ) . S3 appear good S2 turn appear good S1 . Remarkably , rank order assign β€œ systems ” human judges , discuss later . While seem ample signal single n-gram precision , robust combine signal single number metric . 2 . 1 . 3 Combining modify n-gram precision How combine modified precision various n-gram sizes ? A weight linear average modified precision result encouraging result 5 systems . However , see Figure 2 , modify n-gram precision decay roughly exponentially n : modified unigram precision much large modified bigram precision turn much big modified trigram precision . A reasonable averaging scheme must take exponential decay account ; weighted average logarithm modified precision satisifies requirement . BLEU use average logarithm uniform weights , equivalent use geometric mean modified n-gram precisions . 5 , 6 Experimentally , obtain best correlation monolingual human judgment use maximum n-gram order 4 , although 3-grams 5-grams give comparable results . A candidate translation neither long short , evaluation metric enforce . To extent , n-gram precision already accomplish . N-gram precision penalize spurious word candidate appear reference translations . Additionally , modify precision penalize word occur frequently candidate translation maximum reference count . This reward use word many time warranted penalize use word time occur references . However , modify n-gram precision alone fail enforce proper translation length , illustrate short , absurd example . Because candidate short compare proper length , one expect find inflated precisions : modified unigram precision 2/2 , modified bigram precision 1/1 . Traditionally , precision pair recall overcome length-related problems . However , BLEU consider multiple reference translations , may use different word choice translate source word . Furthermore , good candidate translation use ( recall ) one possible choices , . Indeed , recall choice lead bad translation . Here example . The first candidate recall word references , obviously poor translation second candidate . Thus , naive recall compute set reference word good measure . Admittedly , one could align reference translation discover synonymous word compute recall concept rather words . But , give reference translation vary length differ word order syntax , computation complicated . Candidate translation longer reference already penalize modified n-gram precision measure : need penalize . Consequently , introduce multiplicative brevity penalty factor . With brevity penalty place , high-scoring candidate translation must match reference translation length , word choice , word order . Note neither brevity penalty modified n-gram precision length effect directly consider source length ; instead , consider range reference translation length target language . We wish make brevity penalty 1 . 0 candidate ’ length reference translation ’ length . For example , three reference lengths 12 , 15 , 17 word candidate translation terse 12 words , want brevity penalty 1 . We call close reference sentence length β€œ best match length . ” One consideration remains : compute brevity penalty sentence sentence average penalties , length deviation short sentence would punish harshly . Instead , compute brevity penalty entire corpus allow freedom sentence level . We first compute test corpus ’ effective reference length , r , sum best match length candidate sentence corpus . We choose brevity penalty decay exponential r/c , c total length candidate translation corpus . We take geometric mean test corpus ’ modify precision score multiply result exponential brevity penalty factor . Currently , case folding text normalization perform compute precision . We first compute geometric average modified n-gram precisions , pn , use n-grams length N positive weight wn sum one . Next , let c length candidate translation r effective reference corpus length . We compute brevity penalty BP , The ranking behavior immediately apparent log domain , log BLEU = min ( 1 βˆ’ In baseline , use N = 4 uniform weight wn = 1/N . The BLEU metric range 0 1 . Few translation attain score 1 unless identical reference translation . For reason , even human translator necessarily score 1 . It important note reference translation per sentence , high score . Thus , one must cautious make even β€œ rough ” comparison evaluation different number reference translations : test corpus 500 sentence ( 40 general news stories ) , human translator score 0 . 3468 four reference score 0 . 2571 two references . Table 1 show BLEU score 5 system two reference test corpus . The MT system S2 S3 close metric . Hence , several question arise : To answer questions , divide test corpus 20 block 25 sentence , compute BLEU metric block individually . We thus 20 sample BLEU metric system . We compute means , variances , pair t-statistics display Table 2 . The t-statistic compare system left neighbor table . For example , = 6 pair S1 S2 . Note number Table 1 BLEU metric aggregate 500 sentences , mean Table 2 average BLEU metric aggregate 25 sentences . As expected , two set result close system differ small finite block size effects . Since paired t-statistic 1 . 7 95 % significant , difference systems ’ score statistically significant . The report variance 25-sentence block serve upper bound variance sizeable test set like 500 sentence corpus . How many reference translation need ? We simulate single-reference test corpus randomly select one 4 reference translation single reference 40 stories . In way , ensure degree stylistic variation . The system maintain rank order multiple references . This outcome suggest may use big test corpus single reference translation , provide translation translator . We two group human judges . The first group , call monolingual group , consist 10 native speaker English . The second group , call bilingual group , consist 10 native speaker Chinese live United States past several years . None human judge professional translator . The human judge 5 standard system Chinese sentence subset extract random 500 sentence test corpus . We pair source sentence 5 translations , total 250 pair Chinese source English translations . We prepare web page translation pair randomly order disperse five translation source sentence . All judge use webpage saw sentence pair order . They rat translation 1 ( bad ) 5 ( good ) . The monolingual group make judgment base translations ’ readability fluency . As must expected , judge liberal others . And sentence easy translate others . To account intrinsic difference judge sentences , compare judge ’ rating sentence across systems . We perform four pairwise t-test comparison adjacent system order aggregate average score . Figure 3 show mean difference score two consecutive system 95 % confidence interval mean . We see S2 quite bit well S1 ( mean opinion score difference 0 . 326 5-point scale ) , S3 judge little good ( 0 . 114 ) . Both difference significant 95 % level . 7 The human H1 much good best system , though bit bad human H2 . This surprising give H1 native speaker either Chinese English , whereas H2 native English speaker . Again , difference human translator significant beyond 95 % level . 5 BLEU vs The Human Evaluation Figure 5 show linear regression monolingual group score function BLEU score two reference translation 5 systems . The high correlation coefficient 0 . 99 indicates BLEU track human judgment well . Particularly interesting well BLEU distinguishes S2 S3 quite close . Figure 6 show comparable regression result bilingual group . The correlation coefficient 0 . 96 . We take bad system reference point compare BLEU score human judgment score remain system relative bad system . We take BLEU , monolingual group , bilingual group score 5 system linearly normalize corresponding range ( maximum minimum score across 5 systems ) . The normalized score show Figure 7 . This figure illustrate high correlation BLEU score monolingual group . Of particular interest accuracy BLEU ’ estimate small difference S2 S3 large difference S3 H1 . The figure also highlight relatively large gap MT system human translators . 8 In addition , surmise bilingual group forgive judge H1 relative H2 monolingual group find rather large difference fluency translations . We believe BLEU accelerate MT R & D cycle allow researcher rapidly home effective modeling ideas . Our belief reinforce recent statistical analysis BLEU ’ correlation human judgment translation English four quite different language ( Arabic , Chinese , French , Spanish ) represent 3 different language family ( Papineni et al. , 2002 ) ! BLEU ’ strength correlate highly human judg $ Crossing chasm Chinese-English translation appear significant challenge current state-of-the-art systems . ments average individual sentence judgment error test corpus rather attempt divine exact human judgment every sentence : quantity lead quality . Finally , since MT summarization view natural language generation textual context , believe BLEU could adapt evaluate summarization similar NLG tasks . Acknowledgments This work partially support Defense Advanced Research Projects Agency monitor SPAWAR contract No . N66001-99-2-8916 . The view finding contain material author necessarily reflect position policy Government official endorsement inferred . We gratefully acknowledge comment geometric mean John Makhoul BBN discussion George Doddington NIST . We especially wish thank colleague serve monolingual bilingual judge pool perseverance judge output ChineseEnglish MT systems .