Image Recognition Research: Cluster Analysis

In the previous chapter of research we faced situation when some of the pictures that should be marked irrelevant could not be sorted out based on our initial likelihood/distance measures. It was caused by general profile of images from category: most of pictures had abstract features recognized by our chosen service as being closer to irrelevant ones then to actual spare parts. Let's consider how we can deal with such situations.

If we look at the results of failed "hand brake" category, we'll spot one important feature about irrelevant images that failed to be indicated: measures of most of them are located really close to each other, in fact, sometimes they share the same score. This can be explained by nature of our input data: it often contains strings of highly similar or even equal photos. Therefore, if category contains at least one irrelevant picture, there is a high possibility that it has more of them as well. This brings us to the task of cluster analysis: grouping a set of objects in such a way that objects in the same group(cluster) are similar to each other.   

Strategies of cluster analysis 

Current research was largely based on article Cluster Analysis in R - Unsupervised machine learning. It contains a brief overview of most popular methods of cluster analysis and code samples for R language. Below we'll highlight the most valuable parts of article. 

So, there are two basic strategies of cluster analysis: partitioning and hierarchical. Partitioning methods return a vector containing indexes of cluster for every input variable. Hierarchical methods return a tree-based representation of the observations which is called a dendrogram.

In our research hierarchical methods proved to be inefficient: due to large amount of data in sample (5000 observations) such dendrograms became too complicated and unreadable. That's why we decided to stay on partitioning clustering.

Partitioning methods

Here are the three most-known partitioning methods:

  1. K-means clustering, in which, each cluster is represented by the center of the data points belonging to the cluster.
  2. PAM (Partitioning Around Medoids), in which, each cluster is represented by one of the objects in the cluster.
  3. CLARA (Clustering Large Applications), a variant of PAM for analyzing large data sets.

Initial aim was to determine whether such methods can sort irrelevant images in separate clusters at all. We used the most basic method, k-means, to split test set from the first chapter to given number of subsets.  

In case of 20 clusters, all the irrelevant pictures were almost entirely located within two subsets which contained two additional valid pictures. This suggests that with proper choice of number of clusters, we can determine suspicious clusters, which later should be reviewed by hands.

Determining number of clusters

There are plenty of methods how to choose the right number of clusters for partition. They introduce definite measure of quality of partition - elbow, silhouette or gap statistics. Overall, their strategy lies in holding actual computations in given range, and applying these measures to each iteration. Upper ceil of possible numbers still has to be decided. In theory, we can set it to number of observations, but such complex algorithm will last forever. In practice, for categories with 5000 photos we set possible maximum of clusters to 50. Below we can see example output for "locking system" category, suggesting us to use 14 clusters.

As for choice of measure, R's package NbClust has ability to determine number of clusters using combinations of measures. This doesn't affect time significantly as finding measure is much less computationally expensive task then clusterization. Complex measures do exist, but NbClust suggests omitting them for performance purpose. 

In most cases the most suggested number of clusters was the same as the most popular average silhouette measure, so we used it as a primary measure.

Choice of partitioning method

Actually, there is an implementation of PAM algorithm in R that doesn't require determining number of clusters beforehand. On the real data, however, this feature gave us little outcome since such algorithm kept dividing images in just two parts. This can be explained by distribution of data that can be illustrated with cluster plot of "hand brake" images. 

Traditional sample data for cluster analysis usually feature dots located in different regions of plot, however, this is not the case. Data distributes equally throughout the plot (with slightly denser concentration on dots in lower left quarter). Probably this was the reason why autosuggestion of cluster numbers worked poorly.

The reason behind the final choice of partition method was, once again, performance. The last of three discussed algorithms, CLARA, is well-optimized for working with large data sets. It splits data randomly in multiple subsets and computes PAM for each of them.

Another extremely useful feature of CLARA is sampling: after clusterization, it can suggest data from each cluster the samples that represents data in it best.

Due to this feature, we built an app that generated a page of sample pictures with corresponding clusters. Now we could easily scroll samples and if we found irrelevant picture, all data in cluster were marked suspicious.

Clustering results

Described above approach was held for categories that weren't successfully parsed with our primary models. (Note that in practice relevancy of clusters was checked by reviewing samples from CLARA algorithm. We should admit that on this data cluster analysis gave us limited results. For most categories, whenever we looked at sample list, all suggested pictures contained only valid images. It seems like these categories were, indeed, clear.

A positive result was that we managed to extend list of invalid pictures in "hand brake" category as they appeared in respective cluster.    

Search by checksum

Together initial research and cluster analysis allowed us to compose list of 991 records that could be regarded as irrelevant. But it wasn't all yet.

Original dump contains around 3 million unique images, but, as it was stated before, a lot of them have the same content. Therefore, simple comparison by checksum with known irrelevant files could give us more general result.

Out of irrelevant records we defined 52 MD5 hashes and searched for equal hashes though dump. After several iterations number of hashes increased: there were places where the list of irrelevant pictures, which usually go in succession, in some places had gaps. We rechecked these spaces, and, as a result, added a few more hashes. Also we added some manually found images and finally restarted search with list of 76 hashes of irrelevant pictures.

Final results

Overall, after search by checksum we've got a total of 14546 irrelevant pictures which is 0,46% out of 3,153,557 total unique URLs. Such percentage subjectively correlates with frequency of meeting irrelevant pictures on site and we could suggest that, even if we did our research on full data (during this research we partially parsed 13 out of 184 categories) despite its high cost, we won't achieve significant improve in results.