Document Clustering with Evolved Multi-Word Search Queries

Laurence Hirsch; Robin  Hirsh; Bayode Ogunleye

doi:10.2139/ssrn.4749012

Document Clustering with Evolved Multi-Word Search Queries

Laurence Hirsch, Robin Hirsh, Bayode Ogunleye

Research output: Working paper › Preprint

Abstract

Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches rely heavily on a computed similarity measure between documents. We present a novel approach to the problem based on a set of evolved search queries which are generated without reference to the distance between documents. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the document set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set. We therefore implement a second stage once the search query evolution is completed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.

Original language	English
Number of pages	32
DOIs	https://doi.org/10.2139/ssrn.4749012
Publication status	Published - 5 Mar 2024

Access to Document

10.2139/ssrn.4749012Licence: CC BY

SSRN-id4749012Accepted author manuscript, 413 KBLicence: CC BY

Cite this

@techreport{6fe76d4fc0984b74a4639f99acdf8706,

title = "Document Clustering with Evolved Multi-Word Search Queries",

abstract = "Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches rely heavily on a computed similarity measure between documents. We present a novel approach to the problem based on a set of evolved search queries which are generated without reference to the distance between documents. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the document set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set. We therefore implement a second stage once the search query evolution is completed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.",

author = "Laurence Hirsch and Robin Hirsh and Bayode Ogunleye",

year = "2024",

month = mar,

day = "5",

doi = "10.2139/ssrn.4749012",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - Document Clustering with Evolved Multi-Word Search Queries

AU - Hirsch, Laurence

AU - Hirsh, Robin

AU - Ogunleye, Bayode

PY - 2024/3/5

Y1 - 2024/3/5

N2 - Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches rely heavily on a computed similarity measure between documents. We present a novel approach to the problem based on a set of evolved search queries which are generated without reference to the distance between documents. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the document set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set. We therefore implement a second stage once the search query evolution is completed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.

AB - Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches rely heavily on a computed similarity measure between documents. We present a novel approach to the problem based on a set of evolved search queries which are generated without reference to the distance between documents. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the document set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set. We therefore implement a second stage once the search query evolution is completed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.

U2 - 10.2139/ssrn.4749012

DO - 10.2139/ssrn.4749012

M3 - Preprint

BT - Document Clustering with Evolved Multi-Word Search Queries

ER -

Document Clustering with Evolved Multi-Word Search Queries

Abstract

Access to Document

Fingerprint

Cite this