This page is part II on generating inputs using deep learning models trained for image classification. For part I, follow this link. For part III, see here
It is remarkable that deep learning models tasked with image classification are capable of producing coherent images representing a given output class. The task of image generation is far different than classification, but nevertheless recognizable images may be generated by optimizing the output for a given class. In a previous section on this page, we also saw that images may be generated with a linear combination of two classes, which allowed us to transform a generated image of a husky into an image of a dalmatian.
These observations lead to a natural idea: perhaps we can treat the output tensor of a deep learning model as a latent space. A latent space, also known as an embedding, is informally a space that captures meaningful information about the possible objects in that space. More precisely, it is a manifold in which object similarity correlates with a distance metric on that manifold.
We will consider the final 1000-dimensional fully connected layer activation of possible ImageNet categories as the output in question. At first glance it is not clear that this could be any kind of latent space: during training, each category is denoted by a one-hot vector in this space such that all possible categories are the same distance apart from each other. This means that there is no prior information encoded in one output class versus another, which is exactly what one wants when training for classification without prior knowledge.
On the other hand, within this 1000-dimensional space we can view each class as a basis vector for this space and instead consider the possible vectors that exist in this space. A meaningful vector space of the outputs allows us to explore interesting questions by simply converting each question into a vector arithmetic operation.
On a memorable episode of the popular comedy 'Seinfeld', the character George decides to do the opposite of what he would normally do with appropriately comedic results. But one might wonder: what is the opposite? For a number of ideas, there seems to be a natural opposite (light and dark, open and closed) but for others ideas or objects it is more difficult to identify an opposite: for example, what is the opposite of a mountian? One might say a valley, but this is far from the only option. Likewise, objects like a tree and actions like walking do not have clear opposites.
In part I we saw that deep learning models are capable of forming an image that represents some target output
Observe that even though only index 920 was optimized, other output class activations have been affected as well. It may be hypothesized that these activations correspond to a similarity between each of the 999 other ImageNet categories and class 920, with a higher activation signifying a more similar class where 'similar' is a measure on the function
More formally, we want to find the index
where
The above method does not empirically yield very interesting results: the opposites of many ImageNet categories tend to be only a few classes, usually with no apparent relation to the category of interest. There is a clear theoretical basis for why this measure is not very effective: observe that there are many values that are near the minimum for the above image of a 'stoplight'. It is not clear therefore that the index
Finding a meaningful opposite using our image-generating procedure applied to deep learning models will not be difficult if the output is indeed a latent space. We want to perform gradient descent on the input
and the gradient we want is the gradient of this loss with respect to the input, which is
The above formula can be implemented by simply assigning the loss to be the output of the output category as minimization is equivalent to maximization of a negative value.
def layer_gradient(model, input_tensor, desired_output):
...
input_tensor.requires_grad = True
output = model(input_tensor)
loss = output[0][int(desired_output)] # minimize the output class activation
loss.backward()
gradient = input_tensor.grad
return gradient
and as before this gradient
In geometric terms, this procedure is equivalent to the process of moving in the input space in a direction that corresponds with moving in the output space towards the negative axis of the dimension of the output category as far as possible.
At first consideration, this procedure might not seem to be likely to yield any meaningful input
Despite it being unlikely that any of the 1000 ImageNet categories would have only one opposite, we can find the category of the image as classified by our model of choice (GoogleNet) by finding which element of the tensor of the model's output output
, has the maximum activation.
predicted = int(torch.argmax(output))
Now we can label each generated image according to which ImageNet category it most activates using a model of choice, here GoogleNet to be consistent with the image generation. The following video shows the generation of an input
{% include youtube.html id='czayyaAi1cw' %}
Notice how a number of different categories have been maximized, and how the image appears to be a combination of different parts (an axolotl's gills with the feet and scales of a crocodile are perhaps the two most obvious). Some objects have more coherent, even reasonable opposites: toilet paper is soft, flat, and waivy whereas syringes are thing and pointy.
Dogs are perhaps the most interesting image category for this procedure nearly every ImageNet dog class has a coherent opposite that is also a dog, and the opposites generated seem to be logically motivated: observe how the opposites for large, long-haired dogs with no visible ears are small, thin, and perky-eared breeds.
and likewise the opposites of a dog with longer fur and a pointed face (border collie) is one with short fur and squashed face (bloodhound), and the opposite of an image of a small dog with pointed ears (Ibizan hound) is a large dog with droopy ears (Tibetan Mastiff). Observe that opposites are rarely commutative: here we see a close but not quite commutative relation, where the opposite of an Ibizan is a Mastiff but the opposite of a Mastiff is a Terrier. In general opposites are further from being commutative than this example.
It is fascinating to see the generated images for the opposites of other animal classes.
The opposites of snakes are curiously usually lizards (including crocodiles) or amphibians (including axolotls) and the opposites of a number of birds are species of fish. Opposites to all ImageNet class images according to GoogleNet may be found by following this link.
In the last section, inputs representing the opposites of ImageNet classes were generated using gradient descent such that the gradient used to minimize the activation of the ImageNet class in question
We can view the difference between a Husky and a Dalmatian according to some deep learning model by observing what changes as our target class shifts from 'Husky' to 'Dalmatian', all using a picture of a dalmatian as an input. To do this we need to be able to gradually shift the target from the 'Husky' class (which is
and to the sum on the right we can add an
Using this method, we go from
{% include youtube.html id='1bdpG1caKMk' %}
Using InceptionV3 as our model for this experiment, we have we see that this is indeed the case: observe how the fluffy husky tail becomes thin, dark spots form on the fur, and the eye color darkens as
{% include youtube.html id='PBssSJoLOhU' %}
We have so far seen that it is possible to generate recognizable images
We can explore other vector operations. Vector addition is the process of adding the component vectors in a space, and may be thought of as resulting in a vector that contains some of the qualities of both operands. One way to perform vector addition during gradient descent on the input is to perform each update
which leads to the appearance of merged objects, for example this turtle and snowbird
This sort of addition we can call a 'merging', as characteristics of both target classes
Some target classes tend to make recognizable shapes of one but not both
This
However the addition is performed, there are instances in which the output is neither the merging nor juxtaposition of target class objects. For example, (1) applied to addition of a snowbird to a tarantula yields an indeterminate image somewhat resembling a black widow.
Suppose one were to want to understand which of the ImageNet categories were more or less similar to another. For example, is an image of a cat more similar to a fox or a wolf? Specifically we want this question answered with abstract ideas like facial and tail structure, rather than some simple metric like color alone.
This question is not at all easy to address. We seek a metric that will determine how far ImageNet category is from every other category, but the usual metrics one can place on an image will not be sufficient. Perhaps the simplest way to get this metric is to take the average image for each category (by averaging the values of all images of one category pixel per pixel) and measure the
Instead we want a measurement that corresponds to more abstract quantities, like the presence of eyes, number of legs, or roundness of an object in an image. We could use those three traits alone, and make a three-dimensional representation called an embedding consisting of points in space where the basis of the vector space is precisely the values attached to each of these characteristics. For example, if we have some object where [eyes, legs, roundness] = [4, 10, 0.2]
we would likely have some kind of insect, whereas the point [-10, -2, 10]
would most likely be an inanimate object like a beach ball.
Happily for us, deep learning models are capable of observing high-level characteristics of an image. We have seen that feature maps of certain hidden layers of these models tend to be activated by distinctive patterns, meaning that we can use the total or average activation of a feature map as one of our basis vectors.
Somewhat arbitrarily, let's choose two features from GoogleNet's layer 5a as our basis vectors. For reference, here are the maps for the features of interest (meaning that the following images were found to maximally activate the features via gradient descent):
Feature 0 seems to respond to a brightly colored bird-like pattern whereas feature 4 is maximally activated by something resembling a snake's head and scales. We can observe the activation of these layers for GoogleNet-generated images representing each ImageNet class in order to get an idea of which categories these layers score as more or less similar from each other. The following code allows us to plot the embedding performed by these features by plotting the average activation of the two features for each generated output.
def plot_embedding():
x, y, labels_arr = [], [], []
for i, image in enumerate(images):
label = image[1]
image = image[0].reshape(1, 3, 299, 299).to(device)
output = network(image)
x.append(float(torch.mean(output[0, 0, :, :])))
y.append(float(torch.mean(output[0, 4, :, :])))
i = 11
while label[i] not in ',.':
i += 1
labels_arr.append(label[11:i])
plt.figure(figsize=(18, 18))
plt.scatter(x, y)
for i, label in enumerate(labels_arr):
plt.annotate(label, (x[i], y[i]))
plt.show()
plt.close()
return
this yields
which has noticeably skewed distribution,
It appears that Feature 0 corresponds to a measure of something similar to 'brightly-colored bird' whereas Feature 4 is less clear but is most activated by ImageNet categories that are man-made objects.
Investigating which ImageNet categories are more or less similar than each other was explored in the previous section using two features from one layer of a chosen model (GoogleNet). But in one sense, these embeddings are of limited use, because they represent only a very small portion of the information that the model possesses in respect to the input images, as there are many more features in that layer and many more layers in the model. To be specific, the embedding diagram in the last section denotes that 'Jay' is the ImageNet class most similar to 'Indigo Bunting' for GoogleNet, but only for two out of over 700 features of one specific layer.
Each of the features and layers are important to the final classification prediction, and moreover these layers and features are formed by non-linear functions such that the features and layers are non-additive. Therefore although the embeddings of the output categories using feature activation as the basis space is somewhat useful, it is by no means comprehensive. Another approach may be in order in which the entire model is used, rather than a few features.
There does exist a straightforward way to determine which ImageNet categories are more or less similar than each other: we can simply take the model output (with ImageNet classification occuring in the last layer) vector given the generated input
There exists a problem with using the this approach as a true similarity metric, however:
But because pairs of points exhibit an asymmetric measurement, we cannot portray this as a metric space. But it is possible to portray these points as an abstract graph, with nodes corresponding to ImageNet categories (ie outputs) and verticies corresponding to relationships between them. We will start by only plotting the 'nearest neighbor' relationship, which is defined as the output that is most activated by the generated image distinct from the target output
import networkx as nx
def graph_nearest():
# convert array of pairs of strings to adjacency dict
closest_dict = {}
for pair in closest_array:
if pair[0] in closest_dict:
closest_dict[pair[0]].append(pair[1])
else:
closest_dict[pair[0]] = [pair[1]]
G = nx.Graph(closest_dict)
nx.draw_networkx(G, with_labels=True, font_size=12, node_size=200, node_color='skyblue')
plt.show()
plt.close()
return
The first half of the 1000 ImageNet categories are mostly animals, and plotting a graphs for them yields
Nodes that are connected together form a 'component' of the graph, and nodes that are all connected to each other form a complete component called a 'clique'. Cliques of more that two nodes are extremely rare for ImageNet nearest neighbors, but non-trivial (ie those with more than two nodes) components abound, often with very interesting and logical structures. Observe how cats form one component, and terrier dogs preside in another, and mustelids and small mammals in another
For the non-animal half of ImageNet, graphing nearest neighbors yields a component with more members than was observed for any animal component.
This contains many diverse objects, yet often still exhibit relationships that seem reasonable to a human.
Smaller components are often the most illuminating: observe the sports balls clustering together in one component, and the utensils in another component
The above measurement is illuminating but is not a true metric space. Instead of finding the second largest output for our generated inputs, we can find the ImageNet class corresponding to the second closest point (including the point of interest) to our point in the output space. This means that we wish to perform an embedding of the output with respect to the model. The reasoning here is that our trained GoogleNet model with parameters
Now if we use real images from some category, there is no guarantee that
Using these representative inputs
As spaces with more than two or three dimensions are hard to visualize, we can perform a dimensionality reduction method for visualization, and here we will find a function
When we find the coordinates of
but the result is somewhat underwhelming. Principle conponents 1 and 2 account for only
Why is this the case? The failure lies in PCA's expectation of a linear space, in which transformations
and where in particular the intuitive metric of distance stands. As points in this space were generated using a nonlinear function (gradient descent of GoogleNet on a scaled normal input), there is no reason to think that a linear decomposition would be capable of capturing much of the variance in that function.