School of Science, Engineering and Information Technology

IEEE ICDM 2018 Tutorial: Which Outlier Detector should I use?

Abstract

This tutorial has four aims:

  1. Providing the current comparative works on different outlier detectors, and analysing the strengths and weaknesses of these works and their recommendations.
  2. Presenting non-obvious applications of outlier detectors. This provides examples of how outlier detectors are used in areas which are not normally considered to be the domains of outlier detection.
  3. Inviting the research community to explore future research directions, in terms of both comparative study and outlier detection in general.
  4. Giving an advice on the factors to consider when choosing an outlier detector, and strengths and weaknesses of some "top" recommended algorithms based on the current understanding in the literature.

Introduction

Outlier detection is one of the key data mining tasks and a highly active area of research. Though a number of 'new' outlier detectors have been proposed, evaluations provided by individual proposals of these outlier detectors may be limited or bias in some ways, as testified by comparative studies. Facing with many choices, it is unclear which outlier detector shall be used in a particular application. This tutorial aims to provide a guide to audience on how to select an outlier detector in their applications. It presents more recent comparative studies of state-of-the-art methods and their recommendations, provides an analysis of the strengths and limitations of these studies, and describes examples of non-obvious applications of outlier detectors.

Outline of the tutorial

  • Introduction: categories of outlier detectors
  • Current comparative works and their recommendations
  • An analysis of strengths and weaknesses of current comparative works
  • Current bias-variance analyses applied to outlier detection
  • High-dimensional datasets and subspace outlier detection
  • Non-obvious applications of outlier detectors
  • Potential future research directions
  • Factors to consider in choosing an outlier detector

Target audience and prerequisites

Researchers in industry, research students and academics interested in using outlier detectors in applications and future research of outlier detection.

Basic understanding of data mining and machine learning is required.

Presenters

1. Prof Kai Ming Ting

After receiving his PhD from University of Sydney, Kai Ming Ting had worked at University of Waikato, Deakin University and Monash University before joining Federation University. He has served as a program committee co-chair for PAKDD-2008; and a member of program committee for a number of conferences including KDD and ICML. He has received research funding from Australian Research Council, US Air Force of Scientific Research, Toyota InfoTechnology Center, and Australian Institute of Sports. Awards received include the Runner-up Best Paper Award in 2008 IEEE ICDM, and the Best Paper Award in 2006 PAKDD. He is the creator of isolation techniques, mass-based similarity and isolation kernel.

2. Dr Sunil Aryal

Sunil Aryal is a lecturer at Federation University Australia. He received his Master and PhD degrees from Monash University, Australia. His research is in the areas of data mining and machine learning, particularly in similarity learning, ensemble-based and random methods, scale invariant methods with application in classification, clustering and anomaly detection. He has published in top tier venues in the field of data mining and machine learning such as ICDM, PAKDD, machine learning, and knowledge and information systems. Prior to joining academia, he worked in IT industry as a software developer and data engineer for a number of years.

3. Prof Takashi Washio

Takashi Washio is a full professor in Osaka University, and serves as a director of AI Cooperative Research Laboratory, National Institute of Advanced Industrial Science and Technology, NEC Corporation (NEC-AIST). His department in Osaka University focuses on basic studies of machine learning and data mining and is a leading research group in Japan. His NEC-AIST cooperative laboratory orients applications of machine learning, data mining and simulation techniques to various scientific, industrial and social fields. His current main research interests are machine learning principles for high dimensional big data in the basic study and machine learning techniques for scientific advanced sensing in the application study.

Key references

  1. Emmott, A. F., Das, S., Dietterich, T. G., Fern, A., Wong, W.-K. (2013). Systematic construction of anomaly detection benchmarks from real data. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description. 16-21.
  2. Emmott, A. F., Das, S., Dietterich, T. G., Fern, A., Wong, W.-K. (2016). A Meta-Analysis of the Anomaly Detection Problem. arXiv:1503.01158
  3. Aggarwal, C. C. (2017) Outlier Analysis. Second edition. Springer International Publishing.
  4. Aggarwal, C. C., Sathe, S. (2017) Outlier Ensembles: An Introduction. Springer International Publishing.
  5. Ting, K. M., Washio, T., Wells, J. R., Aryal, S. (2016). Defying the Gravity of Learning Curve: A Characteristic of Nearest Neighbour Anomaly Detectors. Machine Learning. Vol 106, Issue 1, 55-91.
  6. Pimentel, M. A. F., Clifton, D. A., Clifton, L., Tarassenko, L. (2014) A review of novelty detection, Signal Processing, Vol. 99, 215-249.
  7. Aggarwal, C. C., Sathe, S. (2015) Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explorations 17(1):24-47
  8. Sugiyama, M., Borgwardt, K. (2013) Rapid distance-based outlier detection via sampling. Advances in Neural Information Processing Systems 26, 467-475
  9. Zimek, A., Gaudet, M., Campello, R. J., Sander, J. (2013) Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 428-436
  10. Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B, Micenkova, B., Schubert, E., Assent, I., Houle, M. E. (2016) On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Mining and Knowledge Discovery, 30(4), 891–927.
  11. Mu, X., Ting, K. M.,  Zhou, Z. H. (2017). Classification under Streaming Emerging New Classes: A Solution using Completely-random Trees. IEEE Transactions on Knowledge and Data Engineering, Vol.29, Issue.8, 1605-1618
  12. Chandola, V., Banerjee, A., Kumar, V. (2009) Anomaly detection: A survey. ACM Computing Survey 41, 3, Article 15 (July 2009), 58 pages.
  13. Nguyen, X. V., Chan, J., Simone R., Bailey, J., Leckie, C., Kotagiri R., Pei, J. (2016). Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery. Volume 30, Issue 6, 1520–1555.
  14. Rayana, S., Zhong, W., Akoglu, L. (2016) Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective.In: Proceedings of IEEE ICDM Conference
  15. Sathe, S., Aggarwal, C. (2016) Subspace Outlier Detection in Linear Time with Randomized Hashing. In: Proceedings of ICDM Conference.
  16. Hodge, V. J., Austin, J. (2004) A survey of outlier detection methodologies. Artificial Intelligence Review. 85-126.
  17. Pang, G., Cao, L., Chen, L., Liu, H. (2018). Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection, In: Proceedings of the 24th SIGKDD Conference on Knowledge Discovery and Data Mining
  18. Pang, G., Cao, L., Chen, L., Lian, D., Liu, H. (2018) Sparse Modeling-based Sequential Ensemble Learning for Effective Outlier Detection in High-dimensional Numeric Data, In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, US.
  19. Zimek, A., Schubert, E., Kriegel, H.-P. (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, vol. 5 (5) 363-387.
  20. Ting, K. M., Zhu, Y., Zhou, Z. H. (2018). Isolation Kernel and Its Effect on SVM. Proceedings of The ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

Materials

Available soon