Image Recognition Research: Choice of API

Research
 
19 September 2017 380

General problem

Project Boodmo contains a catalog of car spare parts. Sometimes (i.e., here) these parts may contain irrelevant images that are not car spare parts or anything similar. We needed to find a way to separate irrelevant images from usual ones.

 

Approach

The first idea on tackling irrelevant images was to use a service for evaluating image content. 

In most cases, such services have the next workflow: they get an image as input, process it, and return a list of tags with suggested entities, each tag supplied with matching probability. For instance, here is an output of IBM Watson Visual Recognition processing image of a disk brake:

Such list allows us to build mathematical models estimating their relevance: when we have all the images in category processed, we can look for image whose tags stand off from others. The choice of mathematical model will be discussed later.

Image recognition services

Image recognition APIs are provided by most IT giants such as Google, Amazon, IBM, Microsoft, as well as some independent organizations. On the first step of research we needed to determine which image recognition service fits our approach best.

Most of APIs provided two kinds of services:

1. Evaluate image based on some pre-trained models for specific classes of images (i.e., food, animals) including general model. General model was the one we used as no one of APIs had a specific model for car spare parts.

2. Tools for training your own model - we didn't consider these in research.

Method

Website of each SaaS provides a free demo, where you can upload your own image and see how service recognizes what's depicted. They were used with images from Boodmo, both relevant and irrelevant, and checked how resulting description matches actual image content. So, at this point it didn't really matter whether the picture actually contained car spare parts. 

The summary of this step of research is a comparative table of services, shown below. It includes information about pricing - how much processing of thousand pictures costs, a number of free requests to API, and some results we got from the evaluation of 10 pictures from site: 9 actual pictures of different categories of car spare parts and one irrelevant (the one named Screenshot).

In recognition result cells, we tended to highlight two kinds of tags out of result list: the most suggested and the most accurate ones. Colors of such cells indicate subjective mark of how accurate recognition was: green - the picture is described correctly, yellow - there were some relevant tags, red - image was described incorrectly.

Let's discuss each API separately.

Amazon Rekognition

Offering broad variety of web services, Amazon couldn't pass by topic as popular as image recognition. For this purpose they hired team behind deep learning startup Orbeus and redesigned it as Rekognition.  

Out of all considered services, Amazon offers the cheapest plan ($1 per 1000 images), but recognition results were not satisfactory: tags were either too vague or simply incorrect. On several occasions service even failed to give any response. 

 

IBM Watson Visual Recognition

Just like rivals, IBM also bought image recognition startup, named Alchemy API, and supplied them with resources of Watson supercomputer. This was much more successful experience: in most cases Watson was able to provide pretty concise description of what's seen on image.

Clarifai

Independent team Clarifai also built system that correctly recognized most entities. Their descriptions were slightly broader than ones supplied by IBM, but still remained under consideration on par with Watson.

In future, this API will become a basis of this research.

Cloudsight

Another indie project, Cloudsight, is one-of-a-kind: extremely expensive and slow, but as a result, provides a string describing image with magnificent accuracy. This kind of output doesn't fit a "tag with probability" idea of current research, but if you ever face challenge of textual description of small amount of pictures, this startup is highly recommended.

Imagga

Together with Clarifai, Imagga is a traditional subject in articles about independent recognition services. However, results are much less impressive and in most cases are simply incorrect.

The description section was constantly amusing us with some crazy suggestions including a couple of giraffes that are next to a map, a close up of a tooth brush, a person on a surf board in a skate park and so on.

Microsoft Cognitive Service

Jokes about quality of Microsoft services seemed not longer be part of popular culture, but Cognitive Service makes you remember them. By design, service provides separately tags and description, and while tags on rare occasion managed to produce at least something relevant to technology, description section was constantly amusing us with some crazy suggestions including a couple of giraffes that are next to a map, a close up of a tooth brush, a person on a surf board in a skate park and so on. Yes, it seems that general model of Microsoft Cognitive Service is simply undertrained and shouldn't be in production at all.

Metamind

Another independent API, initially, as seen in screenshot doc, seemed to work quite well, but after adding test images failed with almost all of them.

{"probabilities":[{"label":"disk brake, disc brake","probability":0.4675085},{"label":"shield, buckler","probability":0.37788174},{"label":"gong, tam-tam","probability":0.046205986},{"label":"pickelhaube","probability":0.012509421},{"label":"magnetic compass","probability":0.012441072}],"object":"predictresponse"}

{"probabilities":[{"label":"web site, website, internet site, site","probability":0.31274727},{"label":"envelope","probability":0.28923094},{"label":"rule, ruler","probability":0.14897835},{"label":"notebook, notebook computer","probability":0.013616736},{"label":"analog clock","probability":0.013235992}],"object":"predictresponse"}

Google Vision

Out of all "tag with probability" systems, Google provides the most concise answers: far more then simply determining the type of spare parts, Google even suggests car models matching them. However, the principle how Google achieves such amazing results are dubious: obviously, Google Vision describes image tags based on web search rather then on just what it sees. Therefore, the only irrelevant image was falsely described as a car since it appeared in pages connected to car, which makes Google Vision useless in context of our research.

Summary

After this step of research, two APIs were chosen for further development: IBM Watson Visual Recognition and Clarifai.