Hunting with Artificial Intelligence: Detection of malicious domains (II)

This post and the full series has been elaborated jointly with Ana Isabel Prieto, Sergio Villanueva and Luis Búrdalo.

In the previous article we commented on the difficulty faced by Threat Hunting analysts as a result of the high number of domains registered daily by an organization. This makes it difficult to analyze and locate potentially malicious domains, which may go unnoticed among so much traffic. For this reason, in an attempt to facilitate the analyst’s task, the use of alternative techniques based on Machine Learning is proposed. Before presenting the different tests performed, the article introduces the algorithms to be used for the detection of anomalies in the domains.

To begin with, it is necessary to comment that having a large and varied database is fundamental for a model to be able to detect potentially malicious domains reliably, since its parameters are going to be adjusted in an environment that must be similar to the real one.

However, there is great difficulty in identifying patterns in high-dimensional data, and even more difficulty in representing such data graphically and expressing them in a way that highlights their similarities and differences. This is where the need arises to use a powerful data analysis tool such as PCA (Principal Components Analysis).

The main objective of PCA is to transform a large set of variables into a smaller one that contains most of the original information. To do this, new dummy variables (principal components) are created in the directions of maximum variance from the linear combination of the original variables, so that each principal component contains the weights of each of the original characteristics. In addition, these new variables are not correlated among themselves, and are calculated according to the amount of variability they explain, so that the first principal component is the one that explains the most variability in the original data.

Figure 1: Percentage of variability explained by each of the principal components in an example dataset. As can be seen, the first 2 principal components alone are sufficient to explain almost 60% of the variability in the data [1].
Figure 2: The PCA tries to find the directions of maximum variability in the data and projects them into a new subspace of lower dimension than the original [1].

Moreover, since each of the principal components is a linear combination of the original variables, this allows us to know which original variables are more important in each principal component and, therefore, to identify which variable or variables have contributed to the fact that, for example, a certain domain is different from the rest [2].

To reduce the dimensionality of the problem, a PCA is applied to the original database of domains with their extracted lexical features, in order to extract as many principal components (PC) as necessary to explain 95% of the variability of the data. These new variables are then used by the unsupervised classification models to label each domain as “normal” or “anomalous”.

The unsupervised algorithms chosen in this case for the detection of anomalous domains have been two: Isolation Forest and One-Class SVM.

Isolation Forest

The first of the algorithms considered in this study is Isolation Forest [3]. This algorithm would fall within the family of decision trees, since it combines the performance of a set of “if-else” classifiers. By means of random cuts on the variables of the dataset, these classifiers isolate the different observations of the dataset, assigning to each one the average number of cuts necessary for its total isolation. Isolation Forest is based on the premise that those observations with more peculiar characteristics will be isolated more quickly from the rest of the data, thus requiring fewer partitions for their isolation [3].

Figure 3: Graphical representation of the observations that the Isolation Forest algorithm would classify as anomalies. These anomalous observations coincide with those that are more isolated from the rest [4].

One Class SVM

The second algorithm considered in this study is One Class SVM [5]. It is an extension of the classical support vector machines for the case of unlabeled data. Here, instead of separating two classes by means of a hyperplane in the feature space, what is attempted is to frame the data sample within a minimum volume hypersphere, so that the outermost points that are left out are considered outliers [5].

Figure 4: Graphical representation of the classification performed by the One-class SVM algorithm according to the type of defined boundary, in this case, spherical [6].

What is considered anomalous?

It is important to note that these algorithms classify as anomalous those elements that are significantly different from the rest, which is not always because they are malicious. For example, if the activity of different users within an organization were monitored, the behavior of system administrators might appear to be anomalous simply because they perform specific tasks that are very different from those of other employees.

On the other hand, it should also not be assumed that malicious activity is significantly less frequent than lawful activity. For example, in the case of a DoS attack, malicious activity would be much greater than lawful activity, so the algorithms used in this series of articles may not be able to detect such activity.

In the next post we will present the tests and results obtained by applying the algorithms already discussed, as well as some details about the database we have worked with.


See also in: