Comparing rating scales and preference judgements in language evaluation

Anja Belz; Eric Kow

doi:10.1.1.167.7542

Comparing rating scales and preference judgements in language evaluation

Anja Belz, Eric Kow

University of Brighton

Research output: Chapter in Book/Conference proceeding with ISSN or ISBN › Conference contribution with ISSN or ISBN › peer-review

Abstract

Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.

Original language	English
Title of host publication	Proceedings of the 6th International Language Generation Congerence (INLG'10)
Place of Publication	Stroudsburg, PA, USA
Publisher	Association for Computational Linguistics
Pages	7-15
Number of pages	9
DOIs	https://doi.org/10.1.1.167.7542
Publication status	Published - 1 Jan 2010
Event	Proceedings of the 6th International Language Generation Congerence (INLG'10) - Dublin, Ireland Duration: 1 Jan 2010 → …

Conference

Conference	Proceedings of the 6th International Language Generation Congerence (INLG'10)
Period	1/01/10 → …

Access to Document

10.1.1.167.7542Licence: Unspecified

http://dl.acm.org/citation.cfm?id=1873743Licence: Unspecified

Cite this

@inproceedings{06c3feb34a664ee8a413df8932e1d6b3,

title = "Comparing rating scales and preference judgements in language evaluation",

abstract = "Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.",

author = "Anja Belz and Eric Kow",

year = "2010",

month = jan,

day = "1",

doi = "10.1.1.167.7542",

language = "English",

pages = "7--15",

booktitle = "Proceedings of the 6th International Language Generation Congerence (INLG'10)",

publisher = "Association for Computational Linguistics",

note = "Proceedings of the 6th International Language Generation Congerence (INLG'10) ; Conference date: 01-01-2010",

}

Belz, A & Kow, E 2010, Comparing rating scales and preference judgements in language evaluation. in Proceedings of the 6th International Language Generation Congerence (INLG'10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 7-15, Proceedings of the 6th International Language Generation Congerence (INLG'10), 1/01/10. https://doi.org/10.1.1.167.7542

Comparing rating scales and preference judgements in language evaluation. / Belz, Anja; Kow, Eric.
Proceedings of the 6th International Language Generation Congerence (INLG'10). Stroudsburg, PA, USA: Association for Computational Linguistics, 2010. p. 7-15.

Research output: Chapter in Book/Conference proceeding with ISSN or ISBN › Conference contribution with ISSN or ISBN › peer-review

TY - GEN

T1 - Comparing rating scales and preference judgements in language evaluation

AU - Belz, Anja

AU - Kow, Eric

PY - 2010/1/1

Y1 - 2010/1/1

N2 - Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.

AB - Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.

U2 - 10.1.1.167.7542

DO - 10.1.1.167.7542

M3 - Conference contribution with ISSN or ISBN

SP - 7

EP - 15

BT - Proceedings of the 6th International Language Generation Congerence (INLG'10)

PB - Association for Computational Linguistics

CY - Stroudsburg, PA, USA

T2 - Proceedings of the 6th International Language Generation Congerence (INLG'10)

Y2 - 1 January 2010

ER -

Comparing rating scales and preference judgements in language evaluation

Abstract

Conference

Access to Document

Fingerprint

Cite this