Separation of Benign and Malicious Network Events for Accurate Malware Family Classification
Labeling malware samples with their appropriate malware family helps understand and track malware evolution and develop mitigation techniques. Current malware analysis techniques that use supervised machine learning rely on classification models that are trained on malware traffic generated from a sandbox environment. These models are then used to classify future unseen observations. In practice, however, malware traffic comes mixed with other legitimate "background" traffic from host machines, such as user browsing and software update traffic. Hence, the classifier's accuracy to predict the correct malware label on unseen (mixed) traffic is low. We propose a novel classification system that uses an Independent Component Analysis (ICA) module that applies distribution decomposition to separate the observed traffic into two components, malware traffic and background traffic. We also use a random forest classifier module to learn a classification model for every malware family, and then use it to predict malware family labels using the output of the ICA module. This system is thus capable of labeling malware traffic after removing background artifacts ("noise"), which makes it more efficient and accurate than current classification methods. Our experiments on three malware family datasets show that the performance of our system improves significantly after removing the background traffic artifacts.