Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Adrian Muscat, Anja Belz

    Research output: Contribution to journalArticlepeer-review

    Abstract

    The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second stepdetects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methodsin terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotationand choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.
    Original languageEnglish
    Pages (from-to)29-42
    Number of pages14
    JournalIEEE Computational Intelligence Magazine
    Volume12
    Issue number3
    DOIs
    Publication statusPublished - 18 Jul 2017

    Bibliographical note

    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

    Fingerprint

    Dive into the research topics of 'Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations'. Together they form a unique fingerprint.

    Cite this