Grouping Data Naturally with K-mean Clustering

 

When analyzing large data volumes, one of the things that is interesting is finding different segments or groups within the data that can be used to review and better understand naturally occurring patterns. In a lot of analyses, users will use business logic based segmentation to filter and group data together and see the patterns based on the different business segments. We see this often when we review revenue by Line Of Business, Product, or Territory. What happens when we don’t want to use these groupings but instead turn to naturally occurring groups within our data set? For this, we can look to k-mean clustering.

K-mean clustering is an algorithm that can be used to find natural groups based on the frequency of information in the data. In insurance, we can use a metric, like Total Insured Value (TIV), as the bases for the mean. We can also specify the number of groups we would like the data broken into. The K-mean algorithm would then sort through the dataset based TIV, find similar TIV values, and assign them a group, creating segments based on the data, not the business rules. These groups can then be graphed to see how the groups were applied to the different TIV segments.

Another application for k-mean clustering would be to identify customer buying patterns or lifetime customer sales. Using customer id and total sales, the amounts can be segmented into different groups to see what patterns emerge for analysis.

When using k-mean clustering, choosing the right number of clusters to segment is also important. Within the k-mean tool set, there are methods to get the optimal number of clusters within the data. In some cases, it will also make sense to break the data into a logical number (3,5,7) to make the segments easy to understand without overwhelming the analysis with too many or too few segments.

The k-mean clustering tools are available for use in many common tools like Python, R, Java, C#, and even Excel. Integrations also exist with tools like Qlik Sense and Power BI.

Adding k-mean to your analysis will supercharge the insights provided by finding relationships that exist within the data and show the power of adding statistical analysis tools to a companies data tool kit. Once you have applied k-mean to one part of your data, look to find other places where it can be applied, then combine segments in the analysis to see what new patterns emerge. Rinse and Repeat.