Abstract
Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches rely heavily on a computed similarity measure between documents. We present a novel approach to the problem based on a set of evolved search queries which are generated without reference to the distance between documents. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the document set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set. We therefore implement a second stage once the search query evolution is completed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.
Original language | English |
---|---|
Number of pages | 32 |
DOIs | |
Publication status | Published - 5 Mar 2024 |