LLM-Generated Samples for Android Malware Detection

Research output: Contribution to journalArticlepeer-review

Abstract

Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of Large Language Models (LLMs) in generating effective malware data for detection tasks remains underexplored. In this study, we fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS, using the KronoDroid dataset. After addressing generation inconsistencies with prompt engineering and post-processing, we evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone. Results show that real-only training achieves near-perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations. In contrast, synthetic-only training produces mixed outcomes, with effectiveness varying across malware families and fine-tuning strategies. These findings suggest that LLM-generated tabular malware feature records can enhance scarce datasets without compromising detection accuracy, but remain insufficient as a standalone training source.
Original languageEnglish
Pages (from-to)5
Number of pages1
JournalDigital
Volume6
Issue number1
DOIs
Publication statusPublished - 18 Jan 2026

Keywords

  • Android
  • malware detection
  • large language models
  • LLMs
  • synthetic data

Fingerprint

Dive into the research topics of 'LLM-Generated Samples for Android Malware Detection'. Together they form a unique fingerprint.

Cite this