Unsupervised Learning

In contrast to supervised learning, unsupervised learning deals with inputs only D=\{\mathbf{x}_i\}^N_{i=1}. This dataset is called unlabelled dataset. Since, we are dealing with inputs only the aim of unsupervised learning is to uncover the latent structure in data. For example, we want to know how many groups (classes) can we make out of the data? how do we effectively represent the data? The most common task in unsupervised learning is clustering. Clustering is a method that automatically organizes data into groups (clusters) based on their similarity. In other words, we wish to learn the underlying structure of the data without the its labels (outputs). Figure 1 (on the left) illustrates a distribution of data without label. One possible way to cluster the data is to divide the data into two groups as shown in the figure on the right. Or we can also cluster the data into four clusters as shown in Figure 2. You might ask, which clustering is correct. There is no definite answer because the data is unlabeled. The performance of the clustering is subjective and domain-specific.

Figure 1: Clustering into two clusters
Figure 2: Clustering into four clusters

Leave a Reply