BERTScore: Evaluating Text Generation with BERT Varsha Kishore Tianyi Zhang Felix Wu Kilian Q. Weinberger Yoav Artzi
I am like I like translate ich liebe es I like it I love it I am loving it
I am like Candidate Reference I like translate ich liebe es I like it I love it I love it I am loving it 0.88/1.00 Metric
Text Generation Evaluation Metrics N-gram matching Embedding-based approaches metrics BLEU (Papineni et al., 2002) Meant 2.0 (Lo, 2017) METEOR (Banerjee & Lavie, 2005) YiSi -1 (Lo et al., 2018) ROUGE (Lin, 2004) BERTScore chrF (Popovic, 2015)
BLEU N-gram Matching Reference The weather is cold today Candidate 1 Candidate 2 The weather is sunny today It is freezing today BLEU cannot identify synonyms BLEU gives higher score to candidate 1
BERTScore: an evaluation metric that uses BERT embeddings
BERT Transformer model pre-trained on masked language modeling and next sentence prediction Generates word token embeddings that reflect their context
BERTScore the weather is cold today Reference the weather is cold today Pairwise cosine similarity Candidate it is freezing today Contextual it is freezing today embedding
Greedy Matching the weather Candidate is ≈ ç cold today is freezing today it ≈ ç Reference
Greedy Matching Precision Recall Match words in candidate to reference Match words in reference to candidate
Greedy Matching Precision Recall Match words in candidate to reference Match words in reference to candidate
Greedy Matching Precision Recall Match words in candidate to reference Match words in reference to candidate
Greedy Matching Precision Recall Match words in candidate to reference Match words in reference to candidate
Greedy Matching Precision Recall 0.713 0.713 0.515 0.858 0.858 0.796 0.796 0.913 0.913 Match words in candidate to reference Match words in reference to candidate
Greedy Matching - Aggregate Precision Recall 0.713 0.515 0.858 0.796 0.713 0.858 0.796 0.913 0.913
Greedy Matching - Aggregate Precision Recall 0.713 0.515 0.858 0.796 0.713 0.858 0.796 0.913 0.913 0.759 0.820
<latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit> <latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit> <latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit> <latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit> F1 = 2 Precision · Recall Precision + Recall
Reference the weather is cold today F1 Score Candidate it is freezing today Pairwise Contextual cosine embedding similarity
Evaluation: WMT Translation Benchmark Human Metric Reference: The weather is cold today. 0.85 0.77 Candidate: It is freezing today. compute correlation Reference: The garden is nice. 0.77 0.71 Candidate: The garden was pretty. Reference: I like apples very much. 0.80 0.79 Candidate: I love apples.
Correlation Study 0.8 BLEU ITER YiSi-1 RUSE BertScore F1 0.6 Correlation 0.4 0.2 0 Czech-English German-English English-Czech English-German Language Pair
4 tasks 8 languages 363 systems
Download here :https://pypi.org/project/bert-score/ Or Just: pip install bert_score Github
Recommend
More recommend