Role of data integration , logging and machine learning in the realm of cyber security

This news back in October last year triggered a train of thought about the critical role logging ,data integration and machine learning can play  in building next generation proactive safeguards to enhance cyber security.


Let’s consider GitHub as a test case. GitHub is a web-based Git repository hosting service, which offers distributed revision control and source code management (SCM) functionality of Git.It also provides access control and collaboration features such as wikis, task management, and bug tracking and feature requests for public as well as private repositeries. GitHub faces several cybersecurity challenges such as: a) Denial of service attack for hosted web application b) Code injection c) Cross site scripting d) Security breach due to cookie or session hijacking e) Man-in-the-middle attack.

Critical team skills

Given the extensive and ever evolving security challenges companies face today,information security team needs skills and technical expertise that can enable it to respond to incidents, perform analysis tasks, and communicate effectively with their constituency and other external contacts. Team members should have experience and understanding of multiple security platforms such as automated and manual testing tools, firewalls, proxy servers, intrusion prevention systems, logging correlation/management, operating systems, protocols , risk assessments and web application firewalls.A real time analytics platform for unstructured log files ( tera peta bytes of logs ) will be helpful in logging correlation and management aspect of security.Machine learning algorithms can be used to develop predictive models to find patterns and detect anomalies in logs.This enables security teams to take preventive and corrective actions.


Sources of telemetry such as IPFIX can be enabled on infrastructure devices. It caches and generates records about network traffic and their characteristics. It can report on various OSI layer network traffic details. For example, it can report traffic on source and destination IP addresses or on transport-layer source and destination port numbers, or it can extract parts of the TCP header. After the information is extracted from the network device, it can be stored and used to perform correlation and analysis.

Network device logs are also useful in certain situations. For example, attempts to compromise an infrastructure device’s management credentials may generate log messages that would reveal the suspicious activity.Network taps or captures. Deep packet inspection from taps in the network is useful when investigating end-host compromises. Indicators of compromise can be investigated from historical packet captures, assuming they are stored for long time duration and the analysis tools can offer the necessary analytical functionality.

Signature vs anomaly based security monitoring

Signature based IDS search for a known identity or signature for each specific intrusion event. Signature monitering is very efficient at sniffing out known signatures of attack but depends on receiving regular updates to it’s signature database to remain in sync with variations in hacker techniques.It becomes in-efficient as the signature database grows in size and complexity.It also requires more CPU cycles to check for every signature and also increases the possibility of false positives.

Anamoly detection based IDS captures all the headers of the IP packets running towards the network. From this, it filters out all known and legal traffic, including web traffic to the organization’s web server, mail traffic to and from its mail server, outgoing web traffic from company employees and DNS traffic to and from its DNS server. This helps detect any traffic that is new or unusual.It is particularly good at identifying sweeps and probes towards network hardware that precede any attack. It gives early warning of potential intrusions. Anomaly detection technique requires continuous uninterrupted timely huge amount of sensor data collection , management and analysis with full data integrity.It also needs deployment of unsupervised,supervised statistical algorithms to train and detect anomalies in network traffic in real time.

Data distribution service

Data distribution service (DDS) or message brokers such as Kafka is deployed to keep telemetry flows directed and in sync, and to ensure timely delivery and message integrity.It implements a publish/subscribe model for sending and receiving data, events, and commands among the nodes. Nodes that produce messages (publishers) create and publish “topics.” DDS delivers the messages from topics to subscribers that declare an interest in that topic.

DDS handles transfer chores: message addressing, data marshalling and unmarshalling (so subscribers can be on different platforms from the publisher), delivery, flow control, retries, etc.

Data enrichment

Once we have access to that data, some forms of data enrichment can be beneficial in adding context to a security event.Low false alarm rates are critical in anomaly detection and desirable in data cleaning. False alarms are generated in anomaly detection systems as not all anomalies are representative of attacks. Purging such anomalies (program faults, system crashes among others) is hence justifiable, but within reasonable limits. For example we can use motif extraction and translation to flag system calls and use translation table to associate motif occurance with probablity of attack.

Data correlation

Monitoring gains value as alerts are correlated from multiple sources of telemetry. But how do we approach this kind of correlation when handling terabytes of log data per day, both in thought process and technology.

Real time collection,aggregation,correlation,detection and communication is ideal solution. Kafka brokers can send multiple types of sensor messages to a storm topology that does correlation and aggregation.Spark managed algorithm implementations can generate accurate and reliable threat information event for an end user’s dashboard for further action to mitigate the security risk.

Thought process behind alert correlation from multiple sources:-

Association – Associating multiple event types and sources across multiple nodes Frequently, event data from multiple sources and nodes is necessary to identify a problem. The correlation engine needs to be able to process data regardless of its origin.

Event sequence  – The current course of action may be influenced by past events. For example, a single port scan by a particular source or network may not be interesting, but comparing that event to short- and long-term histories may unveil a pattern of behavior that requires immediate action.

Event persistence  – For example, short bursts of high load network traffic may be normal, but sustained bursts could indicate a denial of service attack is underway. The ability to link event persistence with periods of time is a critical need of a correlation engine.

Event-directed data collection

As part of correlation, various conditions may require interactions with other systems to complete the process. For example, asset database, customer databases, network device or other agent data may be required. The best correlation solutions go beyond simple security data at run time in order to help diagnose, distinguish and deliver meaningful high priority alerts.

Finally, true correlation is the ability to analyze, compare and match escalated sensor events from multiple sensors in multiple timeframes. Aggregation is an essential pre-requisite for effective, cross-platform, real-time correlation.

Identification & alerting

In the end based on high-signal alerts how do we decide whether a correlated alert needs immediate attention (i.e. going to the security pager) vs a longer time to analysis (like a weekly email wrap up)? This requires threat modeling and design of threat or attack tree to decide where the correlated alert needs to be sent. Examine a network environment from an attacker’s perspective to determine what targets would be most tempting to a person attempting to gain access to a network and what conditions must be met for an attack on such targets to succeed. When vulnerable targets of opportunity have been identified, the environment can be examined to determine how existing safeguards affect the attack conditions. This process reveals relevant threats, which can then be ranked according to the level of risk they present, which remediation activities can deliver the most valuable solution to that threat, and whether mitigation may affect other areas in beneficial or detrimental ways that may affect the value of that remediation.


© Copyright 2017 Topmist, inc. All rights reserved.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *