In part I, it was observed that different models generate slightly different representations of each ImageNet input class. It may be wondered whether or not we can use the gradients of two models to make an input, perhaps by combining the gradients of a model with parameters
where
def layer_gradient(model, input_tensor, desired_output):
...
loss = 0.5*((200 - output[0][int(desired_output)]) + (400 - output2[0][int(desired_output)]))
...
where output
is the (logit) output of ResNet and output2
is the output from GoogleNet.
Images of all 1000 ImageNet classes generated using the combined gradient of GoogleNet and ResNet are available here. From these images it is clear that the combined gradient is as good as or superior to the gradients from only ResNet or GoogleNet with respect to producing a coherent input image, which suggests that the gradients from these models are not substantially dissimilar.
The above observation motivates the following question: can we attempt to understand the differences between models by generating an image representing the difference between the output activations for any image
again with output
and output2
signifying the logit output from ResNet and GoogleNet, respectively.
def layer_gradient(model, input_tensor, desired_output):
...
loss = 2*((200 - output[0][int(desired_output)]) - (400 - output2[0][int(desired_output)]))
...
Which yields
Note that the roll-over bars present in the depiction of go-karts by ResNet50 are absent for GoogleNet's representation of the same class, and that consequently the representation using
For other ImageNet classes, however, there does not tend to be a substantial difference between inputs generated for each class via applying the gradient
It is also apparent that the similarities and differences in model output may be compared by viewing the output as a vector space. Say two models were to give very similar outputs for a representation of one ImageNet class but different outputs for another class. The identities of the classes may help inform an understanding of the the difference between models.
In the last section it was observed that we can understand some of the similarities and differences between models by viewing the output as a vector space, with each model's output on each ImageNet representation being a point in this space.
What if we want one model to generate a representation of an ImageNet class that is similar to another model's representation? We have already seen that some models (GoogleNet and Resnet) generally yield recognizable input representations whereas others (InceptionV3) yield somewhat less-recognizable inputs. But if we were stuck with only using InceptionV3 as our image-generation source, can we try to use some information present from the other models in order to generate a more-recognizalbe image?
One may hypothesize that it could be possible to train one model to become 'more like' another model using the output of the second as the target for the output of the first. Consider one standard method of training using maximum likelihood estimation via minimization of cross-entropy betwen the true output
The true output
But there are indications that this is not an optimal training loss. Earlier on this page we have seen that for trained models, some ImageNet categories are more similar with respect to the output activations
Training requires the separation of the distributions
These observations motivate the hypothesis that we may be able to use the information present in the output of one model
which can be implemented as
def train(model, input_tensor, target_output):
...
output = model(input_tensor)
loss = torch.sum(torch.abs(output - target_output)**2) # sum-of-squares loss
optimizer.zero_grad() # prevents gradients from adding between minibatches
loss.backward()
optimizer.step()
return
Note that more commonly-used metrics like
which may be implemented as
def gradient_descent():
optimizer = torch.optim.SGD(resnet.parameters(), lr=0.00001)
# access image generated by GoogleNet
for i, image in enumerate(images):
break
image = image[0].reshape(1, 3, 299, 299).to(device)
target_output = googlenet(image).detach().to(device)
target_tensor = resnet(image) - target_output
for _ in range(1000):
train(resnet, image, target_output)
Gradient descent is fairly effective at reducing a chosen metric between ResNet
It may be hypothesized that this could be because although ResNet's outputs match GoogleNet's for this class, each class has a different 'meaning', ie latent space location, which would undoubtedly hinder our efforts here. But even if we repeat this procedure to train ResNet's outputs match GoogleNet's(for that model's representations) for all 1000 ImageNet classes, we still do not get an accurate representation of Class 0 (or any other class of interest).
It is certainly possible that this method would be much more successful if applied to natural images rather than generated representations. But this would go somewhat against the spirit of this hypothesis because natural images would bring with them new information that may not exist in the models
These results suggest that modifying
Therefore a more direct approach to making one model yield another model's representations: rather than modifying the first model's parameters
For clarity, this procedure starts with an image representing the representation of a certain class with respect to some model
or in words the target vector is the output of the model of interest
Now rather than performing gradient descent using a target
with the input modification being a modified version of gradient descent using smoothness (ie pixel cross-correlation) via Gaussian convolution
where the initial input
def layer_gradient(model, input_tensor, desired_output):
...
input_tensor.requires_grad = True
output = resnet(input_tensor)
loss = 0.05*torch.sum(torch.abs(target_tensor - output))
loss.backward()
gradient = input_tensor.grad
If this method is successful, it would suggest that our model of interest
The ability of the right point in output space to mimick the representation by another model (for some given class) is even more dramatic when the model's representations of that class are noticeably different. For example, observe the representation of 'Class 11: Goldfinch' by ResNet and GoogleNet in the images on the left and center below. ResNet (more accurately) portrays this class using a yellow-and-black color scheme with a dark face whereas GoogleNet's portrayal has reddish feathers and no dark face. But if we perform the above procedure to ResNet, it too mimicks the GoogleNet output.
Likewise, ResNet's depiction of a transverse flute contains flute players in addition to the instrument itself, whereas GoogleNet's depiction does not. When we vectorize ResNet's output to match that of GoogleNet, we see that ResNet's depiction of a transverse flute no longer contains the players.
It may be wondered whether we can apply our gradient descent method to form representations of natural images, rather than generated ones representing some ImageNet category. All that is required is to change
When we choose two images of Dalmatians as our targets, we see that the representations are indeed accurate and that the features they portray are significant: observe how the top image's representation focuses on the body and legs (which are present in the input) whereas the bottom focuses on the face (body and legs not being present in the input).