K-means Clustering and its real use-case in the Security Domain

What is K-means Clustering?

How does k-means clustering work?

The -means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number () of clusters.

Initially number of so called are chosen. A is a data point (imaginary or real) at the center of a cluster. In fact each centroid is an existing data point in the given input data set, picked at random, such that all are unique (that is, for all and , ). These are used to train a KNN Classifier. The resulting classifier is used to classify (using = 1) the data and thereby produce an initial randomized set of clusters. Each is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and adjustment is repeated until the values of the stabilize. The final will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.

What are the basic steps for K-means clustering?

  • Step 1: Choose the number of clusters k.
  • Step 2: Select k random points from the data as centroids.
  • Step 3: Assign all the points to the closest cluster centroid.
  • Step 4: Re-compute the centroids of newly formed clusters.
  • Step 5: Repeat steps 3 and 4.

Use-Cases in the Security Domain

Malware Detection

Identifying crime localities

Insurance fraud detection

Cyber-profiling criminals

Call record detail analysis

Automatic clustering of it alerts

Crime document classification

THANK YOU

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store