ABSTRACT
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 23.2 Categorizing an Anomaly Detection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
23.2.1 Type of Anomaly Detection Problem (Pre-processing) . . . . . . . . . . . . . . . 412 23.2.2 Local versus Global Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 23.2.3 Availability of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
23.3 A Simple Artificial Unsupervised Anomaly Detection Example . . . . . . . . . . . . . . 417 23.4 Unsupervised Anomaly Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
23.4.1 k-NN Global Anomaly Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 23.4.2 Local Outlier Factor (LOF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 23.4.3 Connectivity-Based Outlier Factor (COF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 23.4.4 Influenced Outlierness (INFLO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.5 Local Outlier Probability (LoOP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.6 Local Correlation Integral (LOCI) and aLOCI . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.7 Cluster-Based Local Outlier Factor (CBLOF) . . . . . . . . . . . . . . . . . . . . . . . . 423 23.4.8 Local Density Cluster-Based Outlier Factor (LDCOF) . . . . . . . . . . . . . . . 424
23.5 An Advanced Unsupervised Anomaly Detection Example . . . . . . . . . . . . . . . . . . . . 425 23.6 Semi-supervised Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
23.6.1 Using a One-Class Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 428 23.6.2 Clustering and Distance Computations for Detecting Anomalies . . . . . 430
23.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
AD - Anomaly Detection
CBLOF - Cluster-based Local Outlier Factor
COF - Connectivity-based Outlier Factor
CSV - Comma-Separated Values
DLP - Data Leakage Prevention
Applications
IDS - Intrusion Detection System
INFLO - Influenced Outlierness
LDCOF - Local Density Cluster-based Outlier Factor
LOCI - Local Correlation Integral
LOF - Local Outlier Factor
LoOP - Local Outlier Probability
LRD - Local Reachability Density
NASA - National Aeronautics and Space Administration
NBA - National Basketball Association
NN - Nearest-Neighbor
SVM - Support Vector Machine
Anomaly detection is the process of finding patterns in a given dataset which deviate from the characteristics of the majority. These outstanding patterns are also known as anomalies, outliers, intrusions, exceptions, misuses, or fraud. The name usually refers to a specific application domain, thus, we are using the generic term anomaly in the following. Anomaly detection can basically be classified as a sub-area of data mining and machine learning. However, the term anomaly detection is not well defined from a mathematical point of view, which makes it necessary to give a more detailed overview in the following section. Even if the reader is very familiar with classification and machine learning, in general, we recommend reading Section 23.2 completely since anomaly detection is fundamentally different and a well-working RapidMiner process needs in almost all cases a deeper understanding of the nature of the given anomaly detection problem.