Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

Laurence Hirsch; Robin Hirsch; Bayode Ogunleye

doi:10.2139/ssrn.4552319

Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

Laurence Hirsch, Robin Hirsch, Bayode Ogunleye

School of Arch, Tech and Eng

Research output: Working paper › Preprint

Abstract

We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k).Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.

Original language	English
Number of pages	31
DOIs	https://doi.org/10.2139/ssrn.4552319
Publication status	Published - 25 Aug 2023

Keywords

Document clustering
search query
genetic algorithm
machine learning
Apache Lucene

Access to Document

10.2139/ssrn.4552319

SSRN-id4552319

Cite this

@techreport{122e0bfbfd8e4bdba70ccfcecf2bd364,

title = "Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown",

abstract = "We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k).Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.",

keywords = "Document clustering, search query, genetic algorithm, machine learning, Apache Lucene",

author = "Laurence Hirsch and Robin Hirsch and Bayode Ogunleye",

year = "2023",

month = aug,

day = "25",

doi = "10.2139/ssrn.4552319",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

AU - Hirsch, Laurence

AU - Hirsch, Robin

AU - Ogunleye, Bayode

PY - 2023/8/25

Y1 - 2023/8/25

N2 - We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k).Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.

AB - We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k).Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.

KW - Document clustering

KW - search query

KW - genetic algorithm

KW - machine learning

KW - Apache Lucene

U2 - 10.2139/ssrn.4552319

DO - 10.2139/ssrn.4552319

M3 - Preprint

BT - Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

ER -

Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown

Abstract

Keywords

Access to Document

Fingerprint

Cite this