TY - JOUR
T1 - Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations
AU - Muscat, Adrian
AU - Belz, Anja
N1 - © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
PY - 2017/7/18
Y1 - 2017/7/18
N2 - The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.
AB - The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.
U2 - 10.1109/MCI.2017.2708559
DO - 10.1109/MCI.2017.2708559
M3 - Article
SN - 1556-603X
VL - 12
SP - 29
EP - 42
JO - IEEE Computational Intelligence Magazine
JF - IEEE Computational Intelligence Magazine
IS - 3
ER -