Review #3 #52

jmgilmer · 2018-06-09T00:51:15Z

The following peer review was solicited as part of the Distill review process. The review was formatted by the editor to help with readability.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to write such a thorough review.

Overall:

I love the subject material in the article. I wish it educated me more.

Currently the article advocates a viewpoint, that is, that image generation algorithms often should work in non-pixel-spaces. However, the article feels like it would be stronger and more useful if it were written from the point of view of teaching me how to do it rather than just convincing me that it should be done.

In particular, most of the examples in the article omit key details that I would want to understand if I were to want to try to apply the ideas. In general, the simpler the example, the more explicit I wish the details were, because then I could try them out more quickly.

I think the article would be better if, for each algorithm it:

Writes the specific transformation (as math, or as pseudocode) instead of only describing it in words.
Writes what the loss would have been before transformation (if applicable), and write down the new loss on the transformed space (showing the sort of adjustments needed).

Even though this might add a few formulas, I suspect that with the right notation, it would actually make the article more readable.

Feedback and questions on each section:

(1) The aligned neuron visualization example describes the parameterization as N(P[shared] + P[unique]), where N is the sigmoid and P[unique] is "high resolution".
A few extra details might make it much easier to understand what is happening:

What's the loss? In particular, how is P[unique] constrained or regularized to be a "high resolution" component? Is an extra term needed in the loss when optimizing P[unique], or does it happen naturally without any extra terms?
How is P[shared] chosen? Is the choice important? The example illustrated looks like a smoothed version of a single random result.
The text says that usually N is a sigmoid. If there are other good choices, what are they? If the sigmoid is the best choice, just say N is a sigmoid.

(2) On style transfer, it is asserted that optimizing the learned image in Fourier space yields better-looking results on non-VGG architectures, but again it would be easier to read if you were more explicit on exactly how the process is different. Here the puzzle is how the loss is affected.

Rather than just citing Olaf, go ahead and write down their loss function as a starting point.
What's the loss in your new setup? In your switch to Fourier space, is the content loss still computed in pixel space, or is the content loss now in Fourier space? Are there any tricky considerations when doing the loss in the frequency domain?
In your switch to Googlenet, which layers go into your style loss (not necessarily obvious, considering the nontrivial topology)? Are the results very sensitive to the by-layer weights?
As a non-expert in style transfer who doesn't have experience in the difficulty of getting non-VGG networks to work, I wasn't super-convinced by the negative examples, because I wondered if other factors might have caused them to not work well. Low-level details seem to be overemphasized in the negative examples, but it leaves me wondering if that could have been repaired other ways, by changing the weighting of high-level vs low-level layers in the style loss. Is there other evidence that the improvement is due to the switch to frequency space?
The interesting assertion is made that the difficulty of style transfer on other architectures is due to checkerboard patterns in gradients. Can that be visualized? Why would we expect the move to Fourier space fix the problem - and is there any illustration that can show how or why that particular problem is fixed?

(3) On transparent neuron visualizations.

The simplicity of this example+idea is really nice.

In this idea I think it would be particularly instructive to explicitly write down the loss and transformation here, showing where the random term comes in, and if+how it interacts with the learned alpha channel. For example, I imagine that (1-alpha) multiplies by the random noise; and that you add a regularizer based on either (1-alpha) or (1-alpha)^2.

(4) CPPN example.
I like the CPPN visualizations, but it left me with a number of questions and unsure how to achieve good results with CPPNs.

To make your CPPN differentiable, I assume you're fixing the CPPN network architecture and fixing the activation functions?
Can the CPPN network you use be described in more detail? What is the set of activation functions?
Can you give a sense for how many parameter are used to lead to the level of complexity in visualizations you shown?
Are the results very sensitive to these choices (CPPN network architecture, activation functions, and number of free parameters)?

(5) Bunny examples #1

The transformation was pretty clearly described. It sounds like the actual transformation you use is 2d projection lit head-on, and the loss is calculated in this 2d projection space.

(6) Bunny example #2

The rendered example didn't work in the draft I saw. Maybe still under construction.
The following detail seemed like it was important but what it actually meant for an implementor seemed unclear to me: "To this end, the style loss is computed as the weighted average of the loss at the current an[d] previous iteration." Does this mean that you are actually optimizing the texture base on two views at once? Or if the previous view's texture is fixed, then why isn't the weighted average of the two losses just equivalent to adding a constant number to the loss (i.e., it wouldn't change anything being optimized?) Or perhaps this means "the loss is computed based on the weighted average of the feature vectors of the current and previous views."
Again, maybe writing down the loss more explicitly would clarify this.

znah · 2018-07-18T12:06:47Z

Thank you so much for your thorough review! We appreciate it a lot that you took the time to engage with our work as deeply as you clearly did. In Distill publications we strive to educate, and we believe that your feedback was instrumental in helping us do that better.

We hope to honor your effort by responding in kind, and by submitting a revised version that addresses many of your suggestion. Let’s address them bit-by-bit:

However, the article feels like it would be stronger and more useful if it were written from the point of view of teaching me how to do it rather than just convincing me that it should be done.

We agree that showing “how to do it” is a worthy goal. In this vein, we have added colab notebooks with more tutorial-like explanations of the implementations for each of the sections in our article. These also contain additional text explaining the implementation details.

In particular, most of the examples in the article omit key details that I would want to understand if I were to want to try to apply the ideas. In general, the simpler the example, the more explicit I wish the details were, because then I could try them out more quickly.

We believe that the added notebooks, which can be launched right from the browser, will allow trying out these methods very quickly. Additionally, we added a new subsection to the introduction, addressing the big picture ideas of how parameterizations effect the optimization results. We also expanded on implementation details in footnotes throughout the article where we thought it was helpful for the interested reader to do so.

Still, we empathize with the wish for more details for each example. We expanded our explanations in many sections, and we added more step-by-step explanations in the linked notebooks.

If our response still seems hesitant, it’s only in the sense that we believe that in many of the applications shown in our paper the details of the described techniques are not at the heart of what we want to communicate. I will expand on this point in the replies to “Feedback and questions on each section”.

Aligned Feature Visualization Interpolation

We rewrote this section entirely. We link to the specifics of the feature visualization task itself, and describe its peculiarities in the context of our work to require less prior knowledge. This also helped sharpen the focus on the technique that we do want to explain: shared parameterizations.

What's the loss? In particular,…

We now link to the Feature Visualization paper that explains the underlying optimization objective in more detail.

How is P[shared] chosen? Is the choice important?

For this specific application, aligning frames in an interpolation, we chose a low-resolution parameterization. The specifics of how to choose a shared parameterization will depend on the intended improvement.

The text says that usually N is a sigmoid. If there are other good choices, what are they? If the sigmoid is the best choice, just say N is a sigmoid.

Agreed, we replaced N by the sigmoid function.

Style Transfer

In your switch to Googlenet, which layers go into your style loss (not necessarily obvious, considering the nontrivial topology)? Are the results very sensitive to the by-layer weights?

Our notebook implementation of this approach lists the layers used and allows trying different configurations.

What's the loss in your new setup? In your switch to Fourier space, is the content loss still computed in pixel space, or is the content loss now in Fourier space? Are there any tricky considerations when doing the loss in the frequency domain?

The loss is still computed in feature space of the pretrained DNN, which, in turns, inputs pixel space representation of the image. We now explicitly call out that we only change the parameterization of the optimized image, and that the original network and loss function is not changed in any way.

As a non-expert in style transfer who doesn't have experience in the difficulty of getting non-VGG networks to work, I wasn't super-convinced by the negative examples, because I wondered if other factors might have caused them to not work well. Low-level details seem to be overemphasized in the negative examples, but it leaves me wondering if that could have been repaired other ways, by changing the weighting of high-level vs low-level layers in the style loss. Is there other evidence that the improvement is due to the switch to frequency space?

That is good epistemic hygiene! We agree that we have not proven causality and weaken our claims respectively (LINK). We now only claim that this specific style transfer setup works with our improved parameterization and not with a default pixel-space parameterization.

The interesting assertion is made that the difficulty of style transfer on other architectures is due to checkerboard patterns in gradients. Can that be visualized?

We now link to the paper introducing and visualizing these artefacts. At the same time we weaken the assertion to a suspicion; this area is an area of ongoing research for us.
Additionally, we highlight the resulting checkerboard patterns in the final optimization in the notebook implementing style transfer with our suggested parameterization.

Why would we expect the move to Fourier space fix the problem - and is there any illustration that can show how or why that particular problem is fixed?

We refer the reader to our previous "feature visualization" article, that explores the effect of switching to the weighted Fourier parameterization for image space gradient optimization.

Transparent neuron visualizations

We now explicitly write out how the optimization objective changes to include alpha transparency.

CPPN example

To make your CPPN differentiable, I assume you're fixing the CPPN network architecture and fixing the activation functions?

Yes, the architecture remains fixed.

Can the CPPN network you use be described in more detail? What is the set of activation functions?

We now provide a full implementation in a notebook. It also shows our choice of activation functions, and the results of choosing different ones.

Can you give a sense for how many parameter are used to lead to the level of complexity in visualizations you shown?

We now explicitly include a calculation of the number of parameters in the associated notebook. We only use on the order of thousands of parameters.

Are the results very sensitive to these choices (CPPN network architecture, activation functions, and number of free parameters)?

We have not attempted to measure the influence of these choices explicitly, but hope that the added notebook will encourage interested readers to explore this space.

3D Feature Visualization

The transformation was pretty clearly described. It sounds like the actual transformation you use is 2d projection lit head-on, and the loss is calculated in this 2d projection space.

Your understanding reads correct to us, the loss is calculated on the rendered 2D image.

3D Style Transfer

The rendered example didn't work in the draft I saw.

We’re sorry you saw the article at a time that the rendered example did not work. We have rewritten that diagram’s code to be more resource efficient and have tested the new version across more browsers.

The following detail seemed like it was important but what it actually meant for an implementor seemed unclear to me…

We agree that the sentence was suboptimal and have replaced this sections entirely. For implementors, we additionally provide our reference implementation.

ludwigschubert added the peer-review label Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review #3 #52

Review #3 #52

jmgilmer commented Jun 9, 2018

znah commented Jul 18, 2018

Review #3 #52

Review #3 #52

Comments

jmgilmer commented Jun 9, 2018

Overall:

Feedback and questions on each section:

znah commented Jul 18, 2018

Aligned Feature Visualization Interpolation

Style Transfer

Transparent neuron visualizations

CPPN example

3D Feature Visualization

3D Style Transfer