An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems

Ehud Reiter; Anja Belz

doi:10.1162/coli.2009.35.4.35405

An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems

Ehud Reiter, Anja Belz

University of Brighton

Research output: Contribution to journal › Article › peer-review

Abstract

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous workon NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

Original language	English
Pages (from-to)	529-558
Number of pages	30
Journal	Computational Linguistics
Volume	35
Issue number	4
DOIs	https://doi.org/10.1162/coli.2009.35.4.35405
Publication status	Published - 31 Dec 2009

Access to Document

10.1162/coli.2009.35.4.35405Licence: Unspecified

http://www.mitpressjournals.org/doi/abs/10.1162/coli.2009.35.4.35405Licence: Unspecified

Cite this

@article{005b7ccfaa6f4e88bfaa29286063f8c9,

title = "An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems",

abstract = "There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous workon NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.",

author = "Ehud Reiter and Anja Belz",

year = "2009",

month = dec,

day = "31",

doi = "10.1162/coli.2009.35.4.35405",

language = "English",

volume = "35",

pages = "529--558",

journal = "Computational Linguistics",

issn = "0891-2017",

number = "4",

}

TY - JOUR

T1 - An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems

AU - Reiter, Ehud

AU - Belz, Anja

PY - 2009/12/31

Y1 - 2009/12/31

N2 - There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous workon NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

AB - There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous workon NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

U2 - 10.1162/coli.2009.35.4.35405

DO - 10.1162/coli.2009.35.4.35405

M3 - Article

SN - 0891-2017

VL - 35

SP - 529

EP - 558

JO - Computational Linguistics

JF - Computational Linguistics

IS - 4

ER -

An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems

Abstract

Access to Document

Fingerprint

Cite this