Prodigy: Probabilistic Deep Generation

Belz, Anya (PI)
Kow, Eric (CoI)

Description

Computational methods for generating language are lagging behind computational methods for analysing language in several ways, most obviously in that they are not used commercially.

The main reasons are that systems for generating language take inordinate amounts of time to build, yet once built cannot be reused, and tend to be severely lacking in language variation, something that is easily perceived as a lack of quality.

The current situation in language generation research is reminiscent of language analysis research in the late 1980s, when symbolic and statistical methods briefly formed entirely separate research paradigms. Language analysis soon moved towards a paradigm merger, realising that symbolic methods lacked the efficiency and robustness that probabilistic methods could provide, which in turn would benefit from the accuracy and subtlety of symbolic methods.

A similar development is currently underway in the field of machine translation where - after several years of purely statistical methods dominating the field - researchers are now beginning to bring linguistic knowledge back in.

The experience from these research fields suggests that higher quality can be achieved when the symbolic and statistical paradigms join forces. Recent research shows that this is likely to be true for language generation too. The purpose of the Prodigy project is to develop, for the first time, a comprehensive, linguistically informed, probabilistic methodology for generating language that substantially improves development time, reusability and language variation in language generation systems, and thereby enhances their commercial viability.

Taking Anja Belz's previous EPSRC-funded research on probabilistic NLG as a starting point, the Prodigy project explored whether the combination of the probabilistic and the linguistic could be as beneficial for the field of language generation as it has been for language analysis.

The team focused on two aspects in particular: (i) developing reusable data representation and encoding strategies, and (ii) developing specific probabilistic techniques for guiding language generation processes. There were tests and evaluation of representations and techniques on five different data sets which were been collected from real-world text production tasks and include weather forecasts, descriptions of museum exhibits, and nurses' reports.

Key findings

Research primarily benefited through advances in understanding of reusable language generation technology, industry through improvements in commercial viability, and the technology itself could help individual users by speeding up text production, as well as by making available a modality that does not always exist (e.g. enabling visually impaired readers to access graphical information).

The experience from other language processing research fields suggested that higher quality can be achieved when symbolic and statistical paradigms join forces. Recent research had shown that this is likely to be true for language generation too. The Prodigy project developed, for the first time, a comprehensive, linguistically informed, probabilistic methodology for generating language that substantially improved development time, reusability and variation in language generation systems, and thereby enhanced their commercial viability.

Prodigy resulted in two major research outputs which were not publications:

1. The Prodigy-METEO corpus of paired numerical and textual data which is freely available and has been used by several research teams as a benchmark.

2. Freely available language generation technology.

Status	Finished
Effective start/end date	18/06/07 → 17/06/10

Funding

EPSRC

Prodigy: Probabilistic Deep Generation

Project Details

Description

Key findings

Funding

Fingerprint