Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

Laurence Hirsch, Robin Hirsch, Bayode Ogunleye

Research output: Working paperPreprint

Abstract

We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k).Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.
Original languageEnglish
Number of pages31
DOIs
Publication statusPublished - 25 Aug 2023

Keywords

  • Document clustering
  • search query
  • genetic algorithm
  • machine learning
  • Apache Lucene

Fingerprint

Dive into the research topics of 'Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown'. Together they form a unique fingerprint.

Cite this