|Title||Scalable analysis of large-scale system logs for anomaly detection|
|Project(s)||Data-Driven Software Engineering Department|
|Publication Type||PhD Thesis|
|Year of Publication||2019|
|Degree awarding institution||Ozyegin University|
System logs provide information regarding the status of system components and various events that occur at runtime. This information can support fault detection, diagnosis, and prediction activities. However, it is a challenging task to analyze and interpret a huge volume of log data, which do not always conform to a standardized structure. As the scale increases, distributed systems can generate logs as a collection of a huge volume of messages from several components. Thus, it becomes infeasible to monitor and detect anomalies efficiently and effectively by applying manual or traditional analysis techniques. There have been several studies that aim at detecting system anomalies automatically by applying machine learning techniques on system logs. However, they offer limited efficiency and scalability. We identified three shortcomings that cause these limitations: i) Existing log parsing techniques do not parse unstructured log messages in a parallel and distributed manner. ii) Log data is processed mainly in offline mode rather than online. That is, the entire log data is collected beforehand, instead of analyzing it piece-by-piece as soon as more data becomes available. iii) Existing studies employ centralized implementations of machine learning algorithms. In this dissertation, we address these shortcomings to facilitate end-to-end scalable analysis of large-scale system logs for anomaly detection. We introduce a framework for distributed analysis of unstructured log messages. We evaluated our framework with two sets of log messages obtained from real systems. Results showed that our framework achieves more than 30% performance improvement on average, compared to baseline approaches that do not employ fully distributed processing. In addition, it maintains the same accuracy level as those obtained with benchmark studies although it does not require the availability of the source code, unlike those studies. Our framework also enables online processing, where log data is processed progressively in successive time windows. The benefit of this approach is that some anomalies can be detected earlier. The risk is that the accuracy might be hampered. Experimental results showed that this risk occurs rarely, only when a window boundary cross-cuts a session of events. On the other hand, the average anomaly detection time is reduced significantly. Finally, we introduce a case study that evaluates distributed implementations of PCA and K-means algorithms. We compared the accuracy and performance of these algorithms both with respect to each other and with respect to their centralized implementations. Results showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions. The performance of PCA turns out to be better than K-means, although we observed that the difference between the two tends to decrease as the degree of parallelism increases.