You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We would like to thank all three reviewers for the excellent feedback. We’ve identified a few themes in the feedback we would like to address at once.
Epistemic Concerns
The task of understanding what a neuron means is an inherently subjective one - and the judgements made in this article all reflect the lenses and biases of the worldview of the authors, and the tools available to understand what the neuron’s role in the neural network.
We attempt, however, to minimize the degree of subjectivity as much as possible by
Probing each neuron in a number of unique ways. In this article, they are feature visualization, dataset examples, and text feature visualization.
Soliciting feedback on what these neurons mean from people from a wide variety of backgrounds.
Giving the reader access, to the extent it is possible, the data and tools used to construct these visualizations.
While we stand by the belief that the modes of organization found in the article. We do not intend for the labels on each individual neuron to be considered the ground truth. We welcome and encourage open debate on what each individual neuron might mean.
Cherry-picking of Facets
Often within the article we have chosen to feature a facet visualization that we believe best suits the neuron. This choice was made for communication purposes, but it does make it ambiguous why that choice was made. For full transparency, we’ve updated the article so that when such a choice is made, the other facets are also selectable.
Reproducibility
The reviewing of this paper presents particular challenges due to the dependence of this work on CLIP, and model disclosure. The weights for CLIP, however, are now released, and we will include a tensorflow port of CLIP RN50_4x, the model used in the paper, and code for doing facet visualizations.
CLIP Model details
Many reviewers have brought attention to the fact that the paper was missing details about the CLIP model. The simultaneous nature of this release has made the review process particularly challenging, but we refer the reader to Radford et al for details on the model training.
Specific Responses
Responses to Reviewer 1
The authors were looking for single interpretable neurons. It is probably worth mentioning the reference [11] (Network Dissection), that quantitatively shown that individual neurons are more interpretable as random dense projections (as hypothesized previously in [30]).
This has been added!
Still I wonder how interpretable the sparse random neural combinations can be? The sections “The Imagenet Challenge” and “Emotion Composition” authors design some of these combinations, but what would happen if they are sampled randomly (say 2-3 neurons)?
We’ve added a footnote that explores dataset examples and feature visualizations for neurons sampled randomly.
Labelling of some neurons across all modalities is much less convincing than for others. For example “kids art” label is well supported by feature vis, dataset samples and text strings, while “egyptian art” only shows up on the feature channel visualization, while the neuron visualization and many dataset samples show a loaf of bread. This brings up the question of how accurate the visualization is. In particular, how does the activation value of the synthetic vis compare to that of the real images?
In part, the answer to this is that not all training set images are able to be included in Microscope, and the most strongly-activating dataset images in Microscope might not be representative. However, we do observe this sort of discrepancy happen sometimes, and especially for polysemantic neurons. Upon further inspection, the “egyptian art” neuron appears more polysemantic than it initially seemed and has been removed from the “Art Style Neurons” section.
The CLIP model is not the first model that operates on joined image-text modality. It may worth to mention the work on automatic image captioning with generative text models, vision part of which might display similar multimodal properties (which briefly mentioned in the footnote 4).
Our goal in this paper is to keep details of the CLIP model sparse, but leave most of the discussion
Why three datasets (twitter{1,2,3}) in microscope?
This is a typo and has been fixed.
“Looking to neuroscience, they might sound like “grandmother neurons,” -- I’d like to see a more detailed explanation of the “grandmother neurons” concept here.
The idea of a grandmother neuron has been highly debated in neuroscience. We’ve added several citations that discuss this issue.
Footnotes are unaligned, i.e. footnote “29” in the paper shows up as “30” in the footnote list in the end.
Fixed!
Fig 6. Really awesome. But where are the US cities?
CLIP has more specific neurons for parts of the US (including an NYC neuron, west coast neuron, and east coast neuron). For city names, we generally see the specific neuron fire rather than the general one. This may be a more general pattern about neurons that have this property of hierarchy: when an image implicates a more precise neuron, the general one tends to fire.
“(this is a terrible description, please change)” - in Microscope
Fixed.
What is the Imagenet performance of the model in case of dense linear probes?
“The two sides meet at the end, going through some processing and .... If we ignore spatial structure.” The procedure seems to be a bit similar with this and other works on weighted pooling. Footnote 34 may need a bit more explanation. For example it should be made clear that averaging happens over the spatial dimensions. Where do those weight matrices come from? Are they parts of the CLIP model?
Added a bit more explanation plus a citation on global weighted average pooling. The weight matrices are part of CLIP.
Fig9 “celebration” image is just noise.
Responses to Reviewer 2
How easy would it be to replicate (or falsify) the results? Depends a great deal on how well the training method is explained in the other paper. The very basics of the training method were described, but it was hard to see the details of how the dataset was collected… What was the source of the (image, caption) pairs used to train CLIP? … Also, how is 'zero-shot' ImageNet classification working?
Thanks for raising this. Our description of CLIP definitely wasn’t on the level detail one expects from a paper on the model, since we’re studying a model produced by others and saw that as being more in the remit of their paper. We included a short appendix describing it, but that’s definitely not as useful as having the paper. Thankfully, the CLIP paper is now out which hopefully resolves this issue.
Typos and typo-likes:
Thanks for catching all of these typos! (Some of them snuck past because it’s harder to spell check text that’s embedded in diagrams.) We’ve gone through and corrected all the typos and similar issues you raised.
It appears that Radford et al [3] is meant to be anonymous at this stage of the review process.
We’re unsure what the correct etiquette should have been here. Thankfully, their paper is now public so this is no longer an issue.
Footnote about 'angel neuron' doesn't actually link to angel neuron.
The angel neuron links to the map visualization and highlights the angel neuron in that view. It can then be clicked to open it in Microscope. This is different from how we normally link neurons, but we think it is more informative.
Complaints about graphs etc:
Microscope takes over 30 seconds to load for me.
This has been improved in the latest version of microscope.
Figures 4 and 15 were glitchy for me, using Firefox on Ubuntu, in a way that made it hard to understand all the relevant information. See this imgur gallery.
Thanks for catching this. We fixed the firefox bugs you noted, and checked all diagrams again in chrome, safari, edge, and firefox.
The dataset pictures that the 'West Africa neuron' (1,257 in 4/5/Add_6) has its highest activations in response to are of humans and gorillas, but not other animals. Plausibly a reflection of crude stereotypes.
The model definitely has a lot of biases and stereotypes. Hopefully one of the values of this kind of research is helping to surface them. And this example definitely draws to mind past models with this exact bias.
In this particular case, there are a few caveats to keep in mind:
(1) The dataset examples you are looking at are from ImageNet, which means that you only see the ImageNet-represented images that cause the neuron to fire most. This doesn’t represent, for example country flags very well. This lack of diversity is hopefully addressed in part by the addition of a new dataset to microscope, the Yahoo Flicker Creative Commons.
(2) Many region neurons respond to local wildlife (for example, the Australia neuron responds to kangaroos, and the North America neuron to moose) so it’s hard to distinguish the extent to which this is a racial stereotype / slur vs geographical knowledge of wildlife.
Interestingly, the 4% of regional neurons the network devotes to Africa is pretty close to the 3% of world GDP Africa represents. Would be interesting to see whether this fit is true for other regions.
That’s a very interesting observation. One could imagine GDP being a proxy for something like how active a population is on certain parts of the internet.
The mental illness neuron seems to respond to 'anxiety', 'depression', and 'bipolar', but would be nice to see responses to wider swath of mental illness. One way to explore could be 3 clusters of personality disorders.
This would be a very interesting direction for future research. Perhaps future models will have a neuron for each cluster! (Or perhaps looking at directions in activation space other than neurons could reveal it in this model.)
Responses to Reviewer 3
Unfortunate that the Hitler neuron responds to German food
Is there any way to determine statistical significance estimated conditional probabilities of neuron classes (eg, for Figures 2, 5, and 7).
There are several ways we could test for statistical significance. Each forces us to consider a specific question, but we don't believe any of them sufficiently get at the heart of what these figures are attempting to address. We believe the diagram communicates better than a summary statistic of the statistical significance.
We did do, however, a follow up experiment on "Black / LGBT Rights" class in Figure 2, because there aren’t many images in this category. The details of the experiment are in the footnote of Figure 2.
Grammatical error: “with more than 90% of the images with a standard deviation greater than 30 are related to Donald Trump”
Fixed.
Faceted feature visualization results are very compelling. I have not seen “faceted feature visualization” before. Reading the appendix, this appears to be a new approach. Would love to see a longer exposition on this, as the results in Figure 4 are strong.
Thanks! We think faceted feature visualization is a pretty useful method, and we’ve added a reference implementation. We’ve expanded on this in the paper in the appendix.
Figure 6 is a tour de force of visualization. Nit: “city name activations” give large dots on map in Asia and Europe, but not elsewhere. Why?
Great question! We don’t have a simple explanation for this phenomenon, but suspect that American cities may be represented in a distributed way, or in other neurons. Note that this phenomenon does not manifest on all models, this does not happen, e.g. in RN101 or v1.
Figure 7 “This is the first neuron we've studied closely with a distinct regime change between medium and strong activations.” Interesting, I wonder why? “Flags” category has a large variance, going from -6 to +22 (perhaps it reflects the degree to which each flag is correlated with “Ghana;” some flags will have low correlation whereas others will have high).
There are two reasons we believe we haven't seen this distinct regime change before. First, our past studies have mostly been focused on extremes, studying feature visualizations and dataset examples that cause the neuron to fire most strongly in either a positive or negative direction. For this reason, it's possible that models we've studied closely do contain neurons with a sudden regime change in their activations and we never noticed.
Also, the neurons of CLIP are qualitatively more complex. It’s thus possible that the regime-change in activations is just another way the neurons are more complex than the neurons in simpler models like InceptionV1.
Nit: feature vis for “string instruments” is a face
The feature visualization in that small figure is showing the face facet by default, but the facet can be changed by hovering over it. In the case of string instruments, none of the facets seem to be a good fit, but the normal feature visualization does show guitar strings.
“Scorpion” is labeled as “fish” and “seafood” in Figure 8. This is an understandable failure mode, but maybe worth pointing out.
We added a note pointing that out in the caption of the figure.
Grammar: “given an text embedding”
This has been fixed.
Why are the feature visualizations of concepts like “lightning” “art” “painting” and “miracle” faces in Figure 10? Could we have used faceted feature visualization to improve the visualization? The diagram still makes intuitive sense and does not need to change.
This is because we are using the face facet, which encourages face-like images.
Adversarial attacks section is clear and experiments make sense. I particularly enjoy the connection to the Stroop effect.
Thank you!
No conclusion section?
We decided that we didn’t want to write a conclusion pro forma if it didn’t add value to the paper. The things we considered writing there were already well covered in the introduction or other sections, and so we skipped it.
The text was updated successfully, but these errors were encountered:
Responses to Reviews
We would like to thank all three reviewers for the excellent feedback. We’ve identified a few themes in the feedback we would like to address at once.
Epistemic Concerns
The task of understanding what a neuron means is an inherently subjective one - and the judgements made in this article all reflect the lenses and biases of the worldview of the authors, and the tools available to understand what the neuron’s role in the neural network.
We attempt, however, to minimize the degree of subjectivity as much as possible by
Probing each neuron in a number of unique ways. In this article, they are feature visualization, dataset examples, and text feature visualization.
Soliciting feedback on what these neurons mean from people from a wide variety of backgrounds.
Giving the reader access, to the extent it is possible, the data and tools used to construct these visualizations.
While we stand by the belief that the modes of organization found in the article. We do not intend for the labels on each individual neuron to be considered the ground truth. We welcome and encourage open debate on what each individual neuron might mean.
Cherry-picking of Facets
Often within the article we have chosen to feature a facet visualization that we believe best suits the neuron. This choice was made for communication purposes, but it does make it ambiguous why that choice was made. For full transparency, we’ve updated the article so that when such a choice is made, the other facets are also selectable.
Reproducibility
The reviewing of this paper presents particular challenges due to the dependence of this work on CLIP, and model disclosure. The weights for CLIP, however, are now released, and we will include a tensorflow port of CLIP RN50_4x, the model used in the paper, and code for doing facet visualizations.
CLIP Model details
Many reviewers have brought attention to the fact that the paper was missing details about the CLIP model. The simultaneous nature of this release has made the review process particularly challenging, but we refer the reader to Radford et al for details on the model training.
Specific Responses
Responses to Reviewer 1
The authors were looking for single interpretable neurons. It is probably worth mentioning the reference [11] (Network Dissection), that quantitatively shown that individual neurons are more interpretable as random dense projections (as hypothesized previously in [30]).
This has been added!
Still I wonder how interpretable the sparse random neural combinations can be? The sections “The Imagenet Challenge” and “Emotion Composition” authors design some of these combinations, but what would happen if they are sampled randomly (say 2-3 neurons)?
We’ve added a footnote that explores dataset examples and feature visualizations for neurons sampled randomly.
Labelling of some neurons across all modalities is much less convincing than for others. For example “kids art” label is well supported by feature vis, dataset samples and text strings, while “egyptian art” only shows up on the feature channel visualization, while the neuron visualization and many dataset samples show a loaf of bread. This brings up the question of how accurate the visualization is. In particular, how does the activation value of the synthetic vis compare to that of the real images?
In part, the answer to this is that not all training set images are able to be included in Microscope, and the most strongly-activating dataset images in Microscope might not be representative. However, we do observe this sort of discrepancy happen sometimes, and especially for polysemantic neurons. Upon further inspection, the “egyptian art” neuron appears more polysemantic than it initially seemed and has been removed from the “Art Style Neurons” section.
The CLIP model is not the first model that operates on joined image-text modality. It may worth to mention the work on automatic image captioning with generative text models, vision part of which might display similar multimodal properties (which briefly mentioned in the footnote 4).
Our goal in this paper is to keep details of the CLIP model sparse, but leave most of the discussion
Why three datasets (twitter{1,2,3}) in microscope?
This is a typo and has been fixed.
“Looking to neuroscience, they might sound like “grandmother neurons,” -- I’d like to see a more detailed explanation of the “grandmother neurons” concept here.
The idea of a grandmother neuron has been highly debated in neuroscience. We’ve added several citations that discuss this issue.
(cite: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822296/),
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822296/
http://www.cs.utexas.edu/~dana/quiroga08.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3032396/
Footnotes are unaligned, i.e. footnote “29” in the paper shows up as “30” in the footnote list in the end.
Fixed!
Fig 6. Really awesome. But where are the US cities?
CLIP has more specific neurons for parts of the US (including an NYC neuron, west coast neuron, and east coast neuron). For city names, we generally see the specific neuron fire rather than the general one. This may be a more general pattern about neurons that have this property of hierarchy: when an image implicates a more precise neuron, the general one tends to fire.
“(this is a terrible description, please change)” - in Microscope
Fixed.
What is the Imagenet performance of the model in case of dense linear probes?
“The two sides meet at the end, going through some processing and .... If we ignore spatial structure.” The procedure seems to be a bit similar with this and other works on weighted pooling. Footnote 34 may need a bit more explanation. For example it should be made clear that averaging happens over the spatial dimensions. Where do those weight matrices come from? Are they parts of the CLIP model?
Added a bit more explanation plus a citation on global weighted average pooling. The weight matrices are part of CLIP.
Fig9 “celebration” image is just noise.
Responses to Reviewer 2
How easy would it be to replicate (or falsify) the results? Depends a great deal on how well the training method is explained in the other paper. The very basics of the training method were described, but it was hard to see the details of how the dataset was collected… What was the source of the (image, caption) pairs used to train CLIP? … Also, how is 'zero-shot' ImageNet classification working?
Thanks for raising this. Our description of CLIP definitely wasn’t on the level detail one expects from a paper on the model, since we’re studying a model produced by others and saw that as being more in the remit of their paper. We included a short appendix describing it, but that’s definitely not as useful as having the paper. Thankfully, the CLIP paper is now out which hopefully resolves this issue.
Thanks for catching all of these typos! (Some of them snuck past because it’s harder to spell check text that’s embedded in diagrams.) We’ve gone through and corrected all the typos and similar issues you raised.
We’re unsure what the correct etiquette should have been here. Thankfully, their paper is now public so this is no longer an issue.
The angel neuron links to the map visualization and highlights the angel neuron in that view. It can then be clicked to open it in Microscope. This is different from how we normally link neurons, but we think it is more informative.
Complaints about graphs etc:
Microscope takes over 30 seconds to load for me.
This has been improved in the latest version of microscope.
Figures 4 and 15 were glitchy for me, using Firefox on Ubuntu, in a way that made it hard to understand all the relevant information. See this imgur gallery.
Thanks for catching this. We fixed the firefox bugs you noted, and checked all diagrams again in chrome, safari, edge, and firefox.
The dataset pictures that the 'West Africa neuron' (1,257 in 4/5/Add_6) has its highest activations in response to are of humans and gorillas, but not other animals. Plausibly a reflection of crude stereotypes.
The model definitely has a lot of biases and stereotypes. Hopefully one of the values of this kind of research is helping to surface them. And this example definitely draws to mind past models with this exact bias.
In this particular case, there are a few caveats to keep in mind:
(1) The dataset examples you are looking at are from ImageNet, which means that you only see the ImageNet-represented images that cause the neuron to fire most. This doesn’t represent, for example country flags very well. This lack of diversity is hopefully addressed in part by the addition of a new dataset to microscope, the Yahoo Flicker Creative Commons.
(2) Many region neurons respond to local wildlife (for example, the Australia neuron responds to kangaroos, and the North America neuron to moose) so it’s hard to distinguish the extent to which this is a racial stereotype / slur vs geographical knowledge of wildlife.
Interestingly, the 4% of regional neurons the network devotes to Africa is pretty close to the 3% of world GDP Africa represents. Would be interesting to see whether this fit is true for other regions.
That’s a very interesting observation. One could imagine GDP being a proxy for something like how active a population is on certain parts of the internet.
The mental illness neuron seems to respond to 'anxiety', 'depression', and 'bipolar', but would be nice to see responses to wider swath of mental illness. One way to explore could be 3 clusters of personality disorders.
This would be a very interesting direction for future research. Perhaps future models will have a neuron for each cluster! (Or perhaps looking at directions in activation space other than neurons could reveal it in this model.)
Responses to Reviewer 3
Unfortunate that the Hitler neuron responds to German food
Is there any way to determine statistical significance estimated conditional probabilities of neuron classes (eg, for Figures 2, 5, and 7).
There are several ways we could test for statistical significance. Each forces us to consider a specific question, but we don't believe any of them sufficiently get at the heart of what these figures are attempting to address. We believe the diagram communicates better than a summary statistic of the statistical significance.
We did do, however, a follow up experiment on "Black / LGBT Rights" class in Figure 2, because there aren’t many images in this category. The details of the experiment are in the footnote of Figure 2.
Grammatical error: “with more than 90% of the images with a standard deviation greater than 30 are related to Donald Trump”
Fixed.
Faceted feature visualization results are very compelling. I have not seen “faceted feature visualization” before. Reading the appendix, this appears to be a new approach. Would love to see a longer exposition on this, as the results in Figure 4 are strong.
Thanks! We think faceted feature visualization is a pretty useful method, and we’ve added a reference implementation. We’ve expanded on this in the paper in the appendix.
Figure 6 is a tour de force of visualization. Nit: “city name activations” give large dots on map in Asia and Europe, but not elsewhere. Why?
Great question! We don’t have a simple explanation for this phenomenon, but suspect that American cities may be represented in a distributed way, or in other neurons. Note that this phenomenon does not manifest on all models, this does not happen, e.g. in RN101 or v1.
Figure 7 “This is the first neuron we've studied closely with a distinct regime change between medium and strong activations.” Interesting, I wonder why? “Flags” category has a large variance, going from -6 to +22 (perhaps it reflects the degree to which each flag is correlated with “Ghana;” some flags will have low correlation whereas others will have high).
There are two reasons we believe we haven't seen this distinct regime change before. First, our past studies have mostly been focused on extremes, studying feature visualizations and dataset examples that cause the neuron to fire most strongly in either a positive or negative direction. For this reason, it's possible that models we've studied closely do contain neurons with a sudden regime change in their activations and we never noticed.
Also, the neurons of CLIP are qualitatively more complex. It’s thus possible that the regime-change in activations is just another way the neurons are more complex than the neurons in simpler models like InceptionV1.
Nit: feature vis for “string instruments” is a face
The feature visualization in that small figure is showing the face facet by default, but the facet can be changed by hovering over it. In the case of string instruments, none of the facets seem to be a good fit, but the normal feature visualization does show guitar strings.
“Scorpion” is labeled as “fish” and “seafood” in Figure 8. This is an understandable failure mode, but maybe worth pointing out.
We added a note pointing that out in the caption of the figure.
Grammar: “given an text embedding”
This has been fixed.
Why are the feature visualizations of concepts like “lightning” “art” “painting” and “miracle” faces in Figure 10? Could we have used faceted feature visualization to improve the visualization? The diagram still makes intuitive sense and does not need to change.
This is because we are using the face facet, which encourages face-like images.
Adversarial attacks section is clear and experiments make sense. I particularly enjoy the connection to the Stroop effect.
Thank you!
No conclusion section?
We decided that we didn’t want to write a conclusion pro forma if it didn’t add value to the paper. The things we considered writing there were already well covered in the introduction or other sections, and so we skipped it.
The text was updated successfully, but these errors were encountered: