It is commonly held that machine-readable dictionaries play a key role in bootstrapping effective wide-coverage language-technology, especially in less well-resourced languages. However, while the linguistic knowledge they contain is clearly necessary for this goal, it is far from clear that the format it is presented in is sufficient to reach it. A crucial step in the deployment of such resources is to map them into lexical databases with standardised and well-understood structure and semantics. Furthermore, considerable additional benefits are obtained if such structure and semantics are shared with other linguistic resources. Achieving such a goal, however, is often not an easy task. This paper describes how such a mapping was carried out in the CONCEDE project, for six Central and Eastern European Languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) for which few wide-coverage lexical resources had previously been available. In a two-stage process, the machine-readable data for each language was first mapped into broadly compatible, TEI-compliant SGML representations, and then these representations were harmonised into a single XML scheme. The resulting framework offers a concise, flexible lexical database specification, with a demonstrable ability to cope with a diverse range of dictionary and language requirements, and lexical resources suitable for monolingual and multilingual application.
|Title of host publication||COMPLEX 2003, 7th Conference on Computational Lexicography and Text Research|
|Number of pages||9|
|Publication status||Published - 2003|
|Event||COMPLEX 2003, 7th Conference on Computational Lexicography and Text Research - Budapest, Hungary|
Duration: 1 Jan 2003 → …
|Conference||COMPLEX 2003, 7th Conference on Computational Lexicography and Text Research|
|Period||1/01/03 → …|