The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extends it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).
|Title of host publication||Data Management Technologies and Applications - 9th International Conference, DATA 2020, Revised Selected Papers|
|Editors||Slimane Hammoudi, Christoph Quix, Jorge Bernardino|
|Place of Publication||Switzerland|
|Number of pages||28|
|Publication status||Published - 23 Jul 2021|
|Name||Communications in Computer and Information Science|
Funding Information: This research is supported by AstraZeneca and the Paraguayan Government.
- Logistic Regression
- Data compression
- Logistic regression