ABSTRACT

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 23.2 Categorizing an Anomaly Detection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

23.2.1 Type of Anomaly Detection Problem (Pre-processing) . . . . . . . . . . . . . . . 412 23.2.2 Local versus Global Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 23.2.3 Availability of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

23.3 A Simple Artificial Unsupervised Anomaly Detection Example . . . . . . . . . . . . . . 417 23.4 Unsupervised Anomaly Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

23.4.1 k-NN Global Anomaly Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 23.4.2 Local Outlier Factor (LOF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 23.4.3 Connectivity-Based Outlier Factor (COF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 23.4.4 Influenced Outlierness (INFLO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.5 Local Outlier Probability (LoOP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.6 Local Correlation Integral (LOCI) and aLOCI . . . . . . . . . . . . . . . . . . . . . . . 422 23.4.7 Cluster-Based Local Outlier Factor (CBLOF) . . . . . . . . . . . . . . . . . . . . . . . . 423 23.4.8 Local Density Cluster-Based Outlier Factor (LDCOF) . . . . . . . . . . . . . . . 424

23.5 An Advanced Unsupervised Anomaly Detection Example . . . . . . . . . . . . . . . . . . . . 425 23.6 Semi-supervised Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

23.6.1 Using a One-Class Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 428 23.6.2 Clustering and Distance Computations for Detecting Anomalies . . . . . 430

23.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

AD - Anomaly Detection

CBLOF - Cluster-based Local Outlier Factor

COF - Connectivity-based Outlier Factor

CSV - Comma-Separated Values

DLP - Data Leakage Prevention

Applications

IDS - Intrusion Detection System

INFLO - Influenced Outlierness

LDCOF - Local Density Cluster-based Outlier Factor

LOCI - Local Correlation Integral

LOF - Local Outlier Factor

LoOP - Local Outlier Probability

LRD - Local Reachability Density

NASA - National Aeronautics and Space Administration

NBA - National Basketball Association

NN - Nearest-Neighbor

SVM - Support Vector Machine

Anomaly detection is the process of finding patterns in a given dataset which deviate from the characteristics of the majority. These outstanding patterns are also known as anomalies, outliers, intrusions, exceptions, misuses, or fraud. The name usually refers to a specific application domain, thus, we are using the generic term anomaly in the following. Anomaly detection can basically be classified as a sub-area of data mining and machine learning. However, the term anomaly detection is not well defined from a mathematical point of view, which makes it necessary to give a more detailed overview in the following section. Even if the reader is very familiar with classification and machine learning, in general, we recommend reading Section 23.2 completely since anomaly detection is fundamentally different and a well-working RapidMiner process needs in almost all cases a deeper understanding of the nature of the given anomaly detection problem.