Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Adrian Muscat; Anja Belz

doi:10.1109/MCI.2017.2708559

Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Adrian Muscat, Anja Belz

Research output: Contribution to journal › Article › peer-review

Abstract

The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.

Original language	English
Pages (from-to)	29-42
Number of pages	14
Journal	IEEE Computational Intelligence Magazine
Volume	12
Issue number	3
DOIs	https://doi.org/10.1109/MCI.2017.2708559
Publication status	Published - 18 Jul 2017

Bibliographical note

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Access to Document

10.1109/MCI.2017.2708559Licence: Unspecified

muscat-belz-ieee-cim-FINAL.pdfAccepted author manuscript, 1.3 MBLicence: Unspecified

Cite this

@article{d1ab2479a5d9445baf32a1832931d203,

title = "Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations",

abstract = "The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.",

author = "Adrian Muscat and Anja Belz",

note = "{\textcopyright} 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.",

year = "2017",

month = jul,

day = "18",

doi = "10.1109/MCI.2017.2708559",

language = "English",

volume = "12",

pages = "29--42",

journal = "IEEE Computational Intelligence Magazine",

issn = "1556-603X",

number = "3",

}

TY - JOUR

T1 - Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

AU - Muscat, Adrian

AU - Belz, Anja

N1 - © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

PY - 2017/7/18

Y1 - 2017/7/18

N2 - The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.

AB - The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.

U2 - 10.1109/MCI.2017.2708559

DO - 10.1109/MCI.2017.2708559

M3 - Article

SN - 1556-603X

VL - 12

SP - 29

EP - 42

JO - IEEE Computational Intelligence Magazine

JF - IEEE Computational Intelligence Magazine

IS - 3

ER -

Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Abstract

Bibliographical note

Access to Document

Fingerprint

Cite this