Image Recognition Research: Mathematical Models

Now, when we've filtered out what APIs suit for research, it's time to decide a measure of how much one image stands apart from all others in its category. This will allow us to have a final choice between our finalists: IBM Watson and Clarifai.

 Likelihood model

As we discussed earlier, recognition services on given images provide list of tuples with matched tags and probability. Let's mark tag/probability tuple {ti,pi} respectively(since pi stands for probability, 0≤pi≤1). This allows us to gather all the tags of images from each category into one big category model where each tag is aligned with the sum of all probabilities for this tag{ti,li}={ti, Σpi}. Here is a JSON example of such model, constructed from small amount of belt pulley images recognized with Clarifai.

  round: 15.435690880000001,
  desktop: 11.130878169999999,
  'round out': 9.58189604,
  spare: 14.69702532,
  technology: 16.45498455,
  closeup: 2.7754252,
  steel: 17.15680852,
  roller: 6.70406938,
  single: 4.70725738,
  close: 0.93656075,
  equipment: 18.41126968,
  ring: 1.8943614,
  isolated: 17.399775979999998,
  stranded: 4.76837015,
  part: 7.601697509999999,
  car: 6.56234279,
  machine: 8.49300154,
  'no person': 8.35393987,
  aluminum: 10.60585878,
  disjunct: 7.27045048,
  glazed: 10.38314444,
  wheel: 5.80297674,
  mechanism: 1.8856983600000001,
  metallic: 7.648597210000001,
  design: 2.62207212,
  old: 0.8699759,

What's the mathematical interpretation of these numbers? If we divide such value by the total number of images, it will be a probability that this tag matches image from this category. Let's call such values the likelihood.

Now, we can take any parsed image, multiply probability pi that the tag describes the image with the likelihood li that the tag belongs to this category, and sum these values up: m=(Σpili)/n. Now the more m is, the more likely this image belongs to this category. Let's call this value likelihood measurement.

Final choice of API

The notion of a likelihood measurement gives us the ability to test which APIs could filter fake images (marked red on the spreadsheets below) out of proper ones on one category.

(Note: Images having different names on two sheets is an issue connected to the differences of IBM Watson and Clarifai inputs: IBM Watson receives image in the request body; Clarifai gets it by URL)

Initially, IBM seemed to be a better option because of closer tags. But it turned out that being too specific in description is a disadvantage in our research model: spare parts recognized by IBM other then idle pulley had less likelihood then irrelevant images, while Clarifai would just mark both with something abstract like 'technology'. All three irrelevant images were sorted correctly when using Clarifai tags, and from this point all measurements will be done with their API only.

Other measures

To underline the importance of category images being alike, some measurements could increase the influence of likelihood that each tag belongs to a category, for instance, by exponentiation of li part of . In this research, small exponents like 1.5 or 1.7 were used to avoid too high values. The final empirical version of the likelihood measurement was m=(Σpili1,5)/1000n.

Vector model

And now for something completely different.

Output of each individual recognition could be treated as a vector with dimension n, where n is the total number of tags met inside a category. If the tag is not present in individual image output, its value in vector equals 0. 

Now we can compute vector distances between two given images - semantically it can be treated as how far descriptions of two images stand from each other. The most popular way to calculate distance between vectors are Euclidean and Manhattan distances. In general, vector distances are described by the Minkowski formula.

If p=2, this distance is known as Euclidean; for p=1 it's Manhattan distance. 

If we sum up vector distances, we'll get another distance measurement, which means how far description of image stands from all other images in category. The more distance is, the more likely this image is irrelevant, while for likelihood it should be the opposite.

Let's see how different metrics sort images’ relevance.

We ensured that the choice of model doesn't make serious changes and doesn't affect the trend. In further research, exponent 1,5 will be used for likelihood model, and Manhattan distance will be used in vector measure. For really huge amounts of data, the likelihood model is more convenient, as it can be computed by O(n) while vector distance requires O(n2) time.

Work on real data

The total amount of images on the Boodmo site is around 10 million. Since recognition of all these pictures via SaaS is expensive, we chose few categories and worked limited by boundaries of the Clarifai free plan, which includes 5000 requests. The first four sheets from the document below contain an evaluation of these category. After computation, suspicious pictures were checked by hand and marked red if they were, indeed, irrelevant.


Results seen in the table are mixed.

The Panel/Tray and Locking categories didn't contain any irrelevant images.

The timing belt category displays success of our mathematical models: most of 1000 bottom pictures indeed contained parts of documentation that should be filtered out. There were actual spare parts as well, but these could be checked by hand.

The hand brake category is an example of a failed experiment of external image stubs, which in theory should be marked as irrelevant, are in the middle of the table. The reason why this happened lies in fact that the hand brake category contains a high amount of documentation pages that should be treated as relevant. APIs recognize such pictures as 'illustrations', as well as stubs, and because of this, it isn’t possible to filter stubs out with the chosen mathematical models.

Histogram analysis reveals that usually data follows Poisson distribution, which is seen slightly better on the vector distance model.  Unusual bars on the edge in a successful timing belt category mark a high amount of similar documentation in pictures. These pictures could be cut as irrelevant with an appropriate choice of value that would slice data for consideration.  

Unfortunately, most of the categories on the Boodmo site contain a lot of relevant documentation as well, and the successfully evaluated timing belt category, with its sharp difference between spare parts and faults, can be considered more like an exception than a trend.

Therefore, the described approach works, but on a limited amount of data.