Use Case of K-means Clustering in the Cyber Security Domain
What is Clustering?
Clustering is a method of unsupervised machine learning that is identifying and grouping similar data points in larger datasets. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
What is K-means Clustering?
K-means clustering is an unsupervised machine learning algorithm for clustering ’n’ observations into ‘k’ clusters where k is predefined or user-defined constant.The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. … Data points are clustered based on feature similarity.
The way the K-means algorithm works is as follows:
- Specify the number of clusters K.
- Initialize centroids and then randomly selecting K data points for the centroids.
- Assign all data points to the closest k.
- After that, the positions of the k centroids are recalculated
- Steps 3 and 4 are repeated until the positions of the centroids no longer move.
Use Case in Cyber security Domain
(a) Cyber Profiling
The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal.
Cyber Profiling process can be directed to the benefit of:
- Identification of users of computers that have been used previously.
- Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.
- Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats.
- Identify the suspected abuser.
In a broader scope of cyber profiling can provide support information in a case, such as counterintelligence and counterterrorism.
The new approach to cyber profiling is to use clustering techniques to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles.
Log (record keeping) is a file that records events in the computer program. Meanwhile, according to the definition of the log is a record of daily activities. Activities that are recorded directly called the transaction log. The log file can be used as a support in the process of cyber forensics to obtain digital evidence during the investigation stage.
Preprocessing is performed to remove duplication of data, check the data inconsistency, and correct errors in the data, such as print errors (typography).
Thank You :)