An empirical study of problems and evaluation of IoT malware classification label sources

Tianwei Lei; Jingfeng Xue; Yong Wang; Thar Baker; Zequn Niu

doi:10.1016/j.jksuci.2023.101898

An empirical study of problems and evaluation of IoT malware classification label sources

Tianwei Lei, Jingfeng Xue, Yong Wang, Thar Baker, Zequn Niu

School of Arch, Tech and Eng

Research output: Contribution to journal › Article › peer-review

Abstract

With the proliferation of malware on IoT devices, research on IoT malicious code has also become more mature. Most studies use learning models to detect or classify malware. Therefore, ensuring high-quality labels for malware samples is crucial to maintaining research accuracy. Researchers typically submit malware samples to Anti-Virus (AV) engines to obtain labels, but different engines have varying rules for detecting maliciousness and variants. This study aims to improve future IoT malware research accuracy by investigating label quality. We address three label-related issues, including Anti-Virus detection technology, naming rules, and label expiration. Additionally, we examine multiple sources of malware labels, including 63 studies on IoT, Windows, and Android malware, as well as commonly used tools such as AVClass and Anti-Virus engines. To evaluate and recommend label sources, we construct classification models using an IoT malware dataset obtained from VirusShare, which is labeled with common tools and Anti-Virus engines and classified based on ELF features. For family classification, we investigate four methods and tools, and for variant hierarchy classification, we compare label overlaps with sample clustering from 12 Anti-Virus engines. Based on our findings, we recommend AVClass for obtaining labels for IoT family classification. For small-scale malware families at the variant level, we recommend using the labels from Ad-Aware, BitDefender, and Emsisoft engines. For large-scale malware families, we advise employing labels from Jiangmin, NANO-Antivirus, and Avira engines, serving as a valuable guide for future IoT malware research.

Original language	English
Article number	101898
Journal	Journal of King Saud University - Computer and Information Sciences
Volume	36
Issue number	1
DOIs	https://doi.org/10.1016/j.jksuci.2023.101898
Publication status	Published - 28 Dec 2023

Access to Document

10.1016/j.jksuci.2023.101898Licence: CC BY-NC-ND

Cite this

@article{f4705c8fb588459380690c28d1464c96,

title = "An empirical study of problems and evaluation of IoT malware classification label sources",

abstract = "With the proliferation of malware on IoT devices, research on IoT malicious code has also become more mature. Most studies use learning models to detect or classify malware. Therefore, ensuring high-quality labels for malware samples is crucial to maintaining research accuracy. Researchers typically submit malware samples to Anti-Virus (AV) engines to obtain labels, but different engines have varying rules for detecting maliciousness and variants. This study aims to improve future IoT malware research accuracy by investigating label quality. We address three label-related issues, including Anti-Virus detection technology, naming rules, and label expiration. Additionally, we examine multiple sources of malware labels, including 63 studies on IoT, Windows, and Android malware, as well as commonly used tools such as AVClass and Anti-Virus engines. To evaluate and recommend label sources, we construct classification models using an IoT malware dataset obtained from VirusShare, which is labeled with common tools and Anti-Virus engines and classified based on ELF features. For family classification, we investigate four methods and tools, and for variant hierarchy classification, we compare label overlaps with sample clustering from 12 Anti-Virus engines. Based on our findings, we recommend AVClass for obtaining labels for IoT family classification. For small-scale malware families at the variant level, we recommend using the labels from Ad-Aware, BitDefender, and Emsisoft engines. For large-scale malware families, we advise employing labels from Jiangmin, NANO-Antivirus, and Avira engines, serving as a valuable guide for future IoT malware research.",

author = "Tianwei Lei and Jingfeng Xue and Yong Wang and Thar Baker and Zequn Niu",

year = "2023",

month = dec,

day = "28",

doi = "10.1016/j.jksuci.2023.101898",

language = "English",

volume = "36",

journal = "Journal of King Saud University - Computer and Information Sciences",

issn = "1319-1578",

publisher = "Elsevier",

number = "1",

}

TY - JOUR

T1 - An empirical study of problems and evaluation of IoT malware classification label sources

AU - Lei, Tianwei

AU - Xue, Jingfeng

AU - Wang, Yong

AU - Baker, Thar

AU - Niu, Zequn

PY - 2023/12/28

Y1 - 2023/12/28

N2 - With the proliferation of malware on IoT devices, research on IoT malicious code has also become more mature. Most studies use learning models to detect or classify malware. Therefore, ensuring high-quality labels for malware samples is crucial to maintaining research accuracy. Researchers typically submit malware samples to Anti-Virus (AV) engines to obtain labels, but different engines have varying rules for detecting maliciousness and variants. This study aims to improve future IoT malware research accuracy by investigating label quality. We address three label-related issues, including Anti-Virus detection technology, naming rules, and label expiration. Additionally, we examine multiple sources of malware labels, including 63 studies on IoT, Windows, and Android malware, as well as commonly used tools such as AVClass and Anti-Virus engines. To evaluate and recommend label sources, we construct classification models using an IoT malware dataset obtained from VirusShare, which is labeled with common tools and Anti-Virus engines and classified based on ELF features. For family classification, we investigate four methods and tools, and for variant hierarchy classification, we compare label overlaps with sample clustering from 12 Anti-Virus engines. Based on our findings, we recommend AVClass for obtaining labels for IoT family classification. For small-scale malware families at the variant level, we recommend using the labels from Ad-Aware, BitDefender, and Emsisoft engines. For large-scale malware families, we advise employing labels from Jiangmin, NANO-Antivirus, and Avira engines, serving as a valuable guide for future IoT malware research.

AB - With the proliferation of malware on IoT devices, research on IoT malicious code has also become more mature. Most studies use learning models to detect or classify malware. Therefore, ensuring high-quality labels for malware samples is crucial to maintaining research accuracy. Researchers typically submit malware samples to Anti-Virus (AV) engines to obtain labels, but different engines have varying rules for detecting maliciousness and variants. This study aims to improve future IoT malware research accuracy by investigating label quality. We address three label-related issues, including Anti-Virus detection technology, naming rules, and label expiration. Additionally, we examine multiple sources of malware labels, including 63 studies on IoT, Windows, and Android malware, as well as commonly used tools such as AVClass and Anti-Virus engines. To evaluate and recommend label sources, we construct classification models using an IoT malware dataset obtained from VirusShare, which is labeled with common tools and Anti-Virus engines and classified based on ELF features. For family classification, we investigate four methods and tools, and for variant hierarchy classification, we compare label overlaps with sample clustering from 12 Anti-Virus engines. Based on our findings, we recommend AVClass for obtaining labels for IoT family classification. For small-scale malware families at the variant level, we recommend using the labels from Ad-Aware, BitDefender, and Emsisoft engines. For large-scale malware families, we advise employing labels from Jiangmin, NANO-Antivirus, and Avira engines, serving as a valuable guide for future IoT malware research.

U2 - 10.1016/j.jksuci.2023.101898

DO - 10.1016/j.jksuci.2023.101898

M3 - Article

SN - 1319-1578

VL - 36

JO - Journal of King Saud University - Computer and Information Sciences

JF - Journal of King Saud University - Computer and Information Sciences

IS - 1

M1 - 101898

ER -

An empirical study of problems and evaluation of IoT malware classification label sources

Abstract

Access to Document

Fingerprint

Cite this