ABSTRACT

A monitoring system becomes increasingly important in analytics over big data falling in a wide family of application scenarios: from online advertising to financial securities exchange, from social networks to medical information systems. The system contains three subsystems i.e. measurement, alerting and diagnosis. Measurement is used to measure if Service Level of Agreement (SLA) is achieved; if Key Performance Indicator (KPI) is met; if system resource is within budget; and other internal measurement indicators such as usage / adoption / coverage / precision / recall of individual components and prediction models. Alerting is targeted to alert about system abnormal situations such as pipeline/service error/over SLA. The purpose of alerting is to shorten MTTD (mean time to detection). Diagnosis provides a tool to better understand the whole system and find the root cause of a system fault more quickly. The target of diagnosis is to shorten the MMTR (mean time to repair). Without monitor, system faults are dicult to detect and hard to track KPI, hence more eorts and time are required to fix a business system issue. In 2006, Khanna [1] developed an external monitor by analyzing external message exchanges. In 2007, Khanna [2] proposed a rule based diagnosis for distributed IT infrastructures. In 2010, Haifeng [3] proposed an invariants based failure diagnosis method for distributed computing systems. In 2010, Joshi [4] proposed a probabilistic model-driven recovery for distributed systems. Some individual software packages [5, 6, 7] e.g. Ganglia, Nagios and Splunk are provided some functionalities for monitor, alerting and diagnosis for distributed systems such as Hadoop

[8]. Due to technical complexity, none of these system is general purpose for analytics applications over big data.