Comparing automatic and human evaluation of NLG systems

Anja Belz, Ehud Reiter

Research output: Chapter in Book/Conference proceeding with ISSN or ISBNConference contribution with ISSN or ISBNpeer-review

Abstract

We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NI ST, B LEU, and ROUGE. We find that NI ST scores correlate best (>0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain.
Original languageEnglish
Title of host publication11th Conference of the European Chapter of the Association for Computational Linguistics
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computational Linguistics
Pages313-320
Number of pages8
ISBN (Print)1932432590
DOIs
Publication statusPublished - 1 Jan 2006
Event11th Conference of the European Chapter of the Association for Computational Linguistics - Trento, Italy
Duration: 1 Jan 2006 → …

Conference

Conference11th Conference of the European Chapter of the Association for Computational Linguistics
Period1/01/06 → …

Keywords

  • Natural language generation systems

Cite this