Corpus-based methods are now dominant in Natural Language Processing (NLP).
Creating big corpora is no longer difficult and the technology to analyze them is
growing faster, more robust and more accurate. However, when an NLP application
performs well on one corpus, it is unclear whether this level of performance would
be maintained on others. To make progress on these questions, we need methods
for comparing corpora. This thesis investigates comparison methods based on the
notions of corpus homogeneity and similarity.
|Date of Award||Jul 2005|