Abstract
The digital age brought with it many opportunities for data analysis, as well as many challengesfor data integration and management. Ontologies are a popular data representation structure
because of their inference properties, used in searching and analysis. However, ontologies must
assume a defined view of the world, or a domain, which may ignore the information stored within
data or could even impose an unsuitable structure (conceptual model) onto the information.
The area of biodiversity has a very specific problem in this regard. Biological taxonomy is, by
nature, fluid, changing and multiple. Gaps in knowledge, evolution and differences of opinion
as to the classification of species mean that there is no single agreed taxonomy, and inconsistent
scientific nomenclature usage is widely tolerated in the biodiversity literature. The importance
of the nomenclature and taxonomies for accurately communicating biodiversity information,
coupled with the difficulty of modelling such information means that there are numerous efforts
to create comprehensive ontologies and other knowledge representation resources of taxonomic
and other biodiversity data. However, despite these efforts many of the resources are still
fragmented, incomplete and work on a premise of imposing a single, external hierarchy onto
the data mapped.
The literature review has revealed that, despite continued recognition of both the inconsistency and plurality of the scientific nomenclature, and of the importance of a proper understanding of the intended meaning of these terms when used, there has been no systematic empirical analysis of nomenclature usage in the biodiversity literature to profile meaning. My research project has applied a combined design science and corpus lexicographic approach to the problem, based on the “Word Sketch” analysis technique provided by the “Sketch Engine” lexicographic analysis tool. This research study has adapted Word Sketches to define a method by which nomenclature usage can be mapped and compared against ontological or other knowledge representation resource information, and across corpora to check for stability of usage and meaning. The method was first developed and tested with two test corpora (on the subject of freshwater fish) against an authoritative knowledge representation resource and was then evaluated through application to three nomenclature profile studies.
The method developed aims to serve people working in biodiversity by helping them to choose a suitable knowledge resource onto which to map specific bodies of data, to identify issues when integrating data, and to identify problems or inconsistencies in data or in knowledge representation resources that need to be reviewed, as well as mapping nomenclature use change across language,
domain, time, author, publication, etc. It could also be developed into a tool to aid novices in taxonomy to identify where multiple variants refer to the same species.
Date of Award | 2020 |
---|---|
Original language | English |
Awarding Institution |
|
Sponsors | EPSRC |
Supervisor | Roger Evans (Supervisor) & Gulden Uchyigit (Supervisor) |