Clustering is a technique by which similar objects or entities are grouped into a single cluster in such as way that the variance of objects internal to a single group is minimal; and that across groups is high. The modulus operandi to this is to find a list of factors or characteristics which are common in these data points of the population – and then group the data points based on their similarity on the identified factors or characteristics.
The key advantage of this clustering technique in context to data science is customer segmentation wherein certain specific strategies or tactics can be devised for a specific sample of respondents or customers; based on important characteristics such as purchasing power, demography, preferences etc.
Why do we need clustering in data science?
The concept of clustering is mostly prevalent in case of marketing analytics wherein segmentation of an entire customer base is done.
Now, let’s take an example – Unilever’ which is one of the leading consumer goods companies in the world plan to launch a product targeting towards the baby boomers based out of Europe. Now, being in the B2C business, it is practically impossible to devise a targeting strategy for each of the consumers. This is where clustering comes into play wherein, they identify certain factors which are common in the baby boomer population – and based on that, they cluster consumers having similar characteristics. Therefore, this helps them scale-up their strategy towards a cluster, as opposed to a consumer.
What are the top 5 clustering algorithms data scientists should know?
Before getting to the most preferred clustering algorithms, it must be noted that clustering is an unsupervised machine learning method; and is a useful tool for statistical data analysis. A few of the preferred clustering algorithms are explained below for reference –
1. K-mean clustering algorithm – This is the most basic clustering algorithms which deals with a random selection of groups, and assigning of a midpoint. Post this, the grouping is done based on distance between the individual data points and the midpoint of that particular group.
While this process is quite basic & simple, classification of groups has certain degree of randomness to it.
2. Mean-shift clustering algorithm – This algorithm is based on a centroid-specific algorithm; and aims to locate the midpoint of each of the groups. This algorithm enables researchers to find out the denser regions of the data distribution. This process is iterative in nature until and unless the centroid window contains the maximum number of data points. Now, this is done for other groups as well. While this algorithm has similarities with k-mean clustering; the key advantage here is that the number of clusters need not be pre-decided.
3. Hierarchical clustering (Agglomerative) – This is more of a tree-like clustering technique wherein each data point in the first iteration is considered to be a cluster. Post that, in each iteration, certain clusters with similar characteristics are merged to form a single group which gets improved with each and every iteration.
4. Expectation maximization (using Gaussian Mixture Model) – The key assumption in using this technique is that the data must follow a Gaussian distribution. The operating model is quite similar to k-mean clustering technique wherein number of clusters are randomized, and then the probability that each data point would be a part of the specific cluster is computed.
5. Spatial clustering (based on density) – In this type of clustering, the starting point is randomly selected; and then the distance of the neighboring points from the starting point is calculated. With every iteration, as the number of neighbor points increases; the clustering process starts; which gets refined with each and every iteration.