
Shutterstock– image search based on composition
Over the past 10 years, we have seen an explosive growth in visual content created and consumed on the Web, but before the success of CNNs, images were found by performing simple keyword searches on the tagsassigned manually. All thischanged around 2012, when A. Krizhevsky,I. Sutskever, andG. E. Hinton published their paper ImageNet Classification with Deep Convolutional Networks. The paper described their architecture used to win the2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It's a competition like the Olympics of computer vision, where teams compete across a range of CV tasks such as classification, detection, and object localization. And that was the first year a CNN gained the top position with a test error rate of 15.4% (the next best entry achieved an test error rate of 26.2%). Ever since then, CNNs have become the de facto approach for computer vision tasks, including becoming the new approach for performing visual search. Most likely, it has been adopted by the likes of Google, Facebook, and Pinterest, making it easier than ever to find that right image.
Recently, (October 2017), Shutterstock announced one of the more novel uses of CNNs, where they introduced the ability for their users to search for not onlymultipleitems in an image, but also the composition of those items. The following screenshot shows an example search for a kitten and a computer, with the kitten on the left of the computer:

So what are CNNs? As previouslymentioned,CNNs are a type of neural network that are well suited for visual content due to their ability toretain spatial information. They are somewhat similar to the previous example, where we explicitly define a filter to extract localized features from the image. A CNN performs a similar operation, but unlike our previous example, filters are not explicitly defined. They are learned through training, and they are not confined to a single layer but rather build with many layers. Each layer builds upon the previous one and each becomes increasingly more abstract (abstract here means a higher-order representation, that is, from pixels to shapes) in what it represents.
To help illustrate this, the following diagram visualizes how a network might build up itsunderstandingof a cat. The first layer's filters extract simple features, such as edges and corners. The next layer builds on top of these with its own filters, resulting in higher-level concepts being extracted, such as shapes or parts of the cat. These high-level concepts are then combined for classification purposes:

This ability to get a deeper understanding of the data and reduce the dependency on manual feature engineering has made deep neural networks one of the most popular ML algorithms over the past few years.
To train the model, we feed the network examples using images as inputs and labels as the expected outputs. Given enoughexamples, the model will build an internal representation for each label, which can be sufficiently used for classification; this, of course, is a type of supervised learning.
Our last task is to find the location of the item or items; to achieve this, we can inspect the weights of the network to find out which pixels activated a particular class, and then create a bounding box around the inputs with the largest weights.
We have now identified the items and their locations within the image. With this information, we can preprocess our repository of images and cache it as metadata to make it accessible via search queries. We will revisit this idea later in the book when you will get a chance to implement a version of this to assist the user in finding images in their photo album.
In this section, we saw how ML can be used to improve user experience and briefly introduced the intuition behind CNNs, a neural network well suited for visual contexts, where retaining proximity of features and building higher levels of abstraction is important. In the next section, we will continue our exploration of ML applications by introducing another example that improves the user experience and a new type of neural network that is well suited for sequential data such as text.