Sketch Engine: Lexicography using computer-based statistical analysis

  • Evans, Roger (PI)
  • Kilgarriff, Adam (PI)
  • Tugwell, D. (CoI)

Project Details

Description

To produce a dictionary, you need a large collection of language, which will tell you how a word is used, how often it appears and where. Dictionary-makers take advantage of a large repository of sentences and literature known as a Corpus, which contains millions of words. For example, the British National Corpus offers 100 million words drawn from literature and spoken conversation, providing a valuable idea of how contemporary language is used.

Starting in the 1990s with research into enhancement of online lexical resources, Dr Roger Evans and Dr Adam Kilgarriff developed a new approach to lexicography using computer-based statistical analysis of the behaviour of individual words in large bodies of text online.

When the research began the researchers started off thinking about how you build resources that could be used for computerised language processing systems. In order to achieve this, a lot of information about how words behave was needed, so they started looking at dictionaries. What they found was more interesting and challenging so the project evolved into supporting the dictionary-making process rather than drawing from existing dictionaries. There was a definite progression from building computer systems to creating tools for production.

The key innovation of the project was a new method for creating word sense profiles, or word sketches, capturing the detailed behaviour of individual words from large collections of text. Using these word sketches, they created a computational lexicography tool, which was commercialised as the ‘Sketch Engine’ by Lexical Computing Ltd, a company set up in 2003 by Kilgarriff and Pavel Rychlý, a researcher in text processing tools at Masaryk University in Brno, Czech Republic, at a time when dictionary publishers were beginning to look at moving online.

Key findings

The Sketch Engine has been adopted by four of the UK’s five major dictionary publishers. Lexical Computing Ltd is working with Oxford University Press to analyse children’s language and Cambridge University Press to analyse the language produced by learners of English. National language institutes in nine European countries and 200 universities worldwide use it to support language research, dictionary production, language technology products and to enable language teaching. It has allowed users to access information on between 30 million and 70 billion words in 61 different languages. Lexical Computing Ltd now employs staff in the UK and the Czech Republic, along with freelancers in a number of other countries. Half of the company’s business is overseas and it runs training courses around the world.

The Sketch Engine has also been used to substantiate arguments in a pervasive debate about language use in the art world. A 2010 analysis of exhibition announcements, which utilised the Sketch Engine’s search tool, was published in the US art journal Triple Canopy and sparked an international debate on the language of art. This journal article has since become a widely circulated piece of online cultural criticism, sparking further debates on other forums, including Wordpress, Tumblr, Google+, Ikono, Artblog and Artsia.

"Ultimately, the Sketch Engine allows us to create from the Collins Corpus a true picture of language as it is currently used and gives us empirical evidence on which to base our content. This allows us to claim with confidence that our language reference products are based on language as it is really used and so are the most authoritative available." 
David Wark, Senior Publishing Systems and Data Developer, HarperCollins

StatusFinished
Effective start/end date1/01/9831/12/14

Keywords

  • Lexicography
  • Natural language

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.