Understanding Anomaly Detection Techniques for Symbolic Sequences

Date of Submission: 
January 5, 2009
Report Number: 
Report PDF: 
We present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data. The specific contributions of this paper are as follows: (i). This evaluation facilitates understanding of the relative strengths and weaknesses of different techniques. Through careful experimentation, we illustrate that the performance of different techniques is dependent on the nature of sequences, and the nature of anomalies in the sequences. No one technique outperforms all others. For most techniques we also identify some data sets on which they perform very well, and some on which they perform poorly. (ii). We investigate variants that have not been tried before. For example, we evaluate a k-nearest neighbor based technique that performs better than a clustering based technique that was proposed for sequences. Also, we propose FSA-z, a variant of an existing Finite State Automaton (FSA) based technique, which performs consistently superior to the original FSA based technique. (iii). We propose a novel way of generating artificial sequence data sets to evaluate anomaly detection techniques. (iv). We characterize the nature of normal and anomalous test sequences, and associate the performance of each technique to one or more of such characteristics.