You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
General Comments
The article presents an interesting empirical exploration of the properties of a multimodal model, trained to align images and corresponding captions in the embedding space. It demonstrates that high-level semantic concepts, presented in various modalities, can be associated with individual neurons in the final layers of the neural net. I enjoyed reading the paper. Many feature visualizations, produced by neural networks are strikingly beautiful. Plots and diagrams are well made and provoke readers for further exploration.
From my perspective I’d like to see more steps beyond documenting the properties of deep neural nets (like alignment of features with human-interpretable concepts, and generative capabilities of discriminative neural nets) to figuring out why and how neural networks have these properties.
Questions, comments and nitpicks.
The authors were looking for single interpretable neurons. It is probably worth mentioning the reference [11] (Network Dissection), that quantitatively shown that individual neurons are more interpretable as random dense projections (as hypothesized previously in [30]). Still I wonder how interpretable the sparse random neural combinations can be? The sections “The Imagenet Challenge” and “Emotion Composition” authors design some of these combinations, but what would happen if they are sampled randomly (say 2-3 neurons)?
Labelling of some neurons across all modalities is much less convincing than for others. For example “kids art” label is well supported by feature vis, dataset samples and text strings, while “egyptian art” only shows up on the feature channel visualization, while the neuron visualization and many dataset samples show a loaf of bread. This brings up the question of how accurate the visualization is. In particular, how does the activation value of the synthetic vis compare to that of the real images?
The CLIP model is not the first model that operates on joined image-text modality. It may worth to mention the work on automatic image captioning with generative text models, vision part of which might display similar multimodal properties (which briefly mentioned in the footnote 4).
Why three datasets (twitter{1,2,3}) in microscope?
Some examples concepts, discovered by the network are truly fascinating, and the spurious associations are hilarious, I really enjoyed:
Bear, that’s also a teddy
Past tense concept is amazing
Dataset samples suggest that Christmas/ass neuron is also triggered by the word “butt”
“Looking to neuroscience, they might sound like “grandmother neurons,” -- I’d like to see a more detailed explanation of the “grandmother neurons” concept here.
Footnotes are unaligned, i.e. footnote “29” in the paper shows up as “30” in the footnote list in the end.
Footnote 29or30: “Neuron activations tend to follow an exponential distribution in their tails”. In my experience the tails are much heavier on the positive side, than on the negative. Likely caused but ReLUs, this phenomena may have some interesting philosophical implications, i.e. it’s much easier to prove that something exists, than that it doesn’t.
Fig 6. Really awesome. But where are the US cities?
“(this is a terrible description, please change)” - in Microscope
Image on top of fig8. Why does it use facial facet for “birds”, ‘string instruments’ and ‘dogs’?
Fig 8. Really love the sparse matrix of class-category relation!
What is the Imagenet performance of the model in case of dense linear probes?
“The two sides meet at the end, going through some processing and .... If we ignore spatial structure.” The procedure seems to be a bit similar with this and other works on weighted pooling. Footnote 34 may need a bit more explanation. For example it should be made clear that averaging happens over the spatial dimensions. Where do those weight matrices come from? Are they parts of the CLIP model?
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results
Advancing the Dialogue
Score
How significant are these contributions?
5/5
Outstanding Communication
Score
Article Structure
5/5
Writing Style
4/5
Diagram & Interface Style
5/5
Impact of diagrams / interfaces / tools for thought?
5/5
Readability
4/5
Scientific Correctness & Integrity
Score
Are claims in the article well supported?
5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?
4/5
How easy would it be to replicate (or falsify) the results?
4/5
Does the article cite relevant work?
4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?
4/5
The text was updated successfully, but these errors were encountered:
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
General Comments
The article presents an interesting empirical exploration of the properties of a multimodal model, trained to align images and corresponding captions in the embedding space. It demonstrates that high-level semantic concepts, presented in various modalities, can be associated with individual neurons in the final layers of the neural net. I enjoyed reading the paper. Many feature visualizations, produced by neural networks are strikingly beautiful. Plots and diagrams are well made and provoke readers for further exploration.
From my perspective I’d like to see more steps beyond documenting the properties of deep neural nets (like alignment of features with human-interpretable concepts, and generative capabilities of discriminative neural nets) to figuring out why and how neural networks have these properties.
Questions, comments and nitpicks.
The authors were looking for single interpretable neurons. It is probably worth mentioning the reference [11] (Network Dissection), that quantitatively shown that individual neurons are more interpretable as random dense projections (as hypothesized previously in [30]). Still I wonder how interpretable the sparse random neural combinations can be? The sections “The Imagenet Challenge” and “Emotion Composition” authors design some of these combinations, but what would happen if they are sampled randomly (say 2-3 neurons)?
Labelling of some neurons across all modalities is much less convincing than for others. For example “kids art” label is well supported by feature vis, dataset samples and text strings, while “egyptian art” only shows up on the feature channel visualization, while the neuron visualization and many dataset samples show a loaf of bread. This brings up the question of how accurate the visualization is. In particular, how does the activation value of the synthetic vis compare to that of the real images?
The CLIP model is not the first model that operates on joined image-text modality. It may worth to mention the work on automatic image captioning with generative text models, vision part of which might display similar multimodal properties (which briefly mentioned in the footnote 4).
Why three datasets (twitter{1,2,3}) in microscope?
Some examples concepts, discovered by the network are truly fascinating, and the spurious associations are hilarious, I really enjoyed:
Bear, that’s also a teddy
Past tense concept is amazing
Dataset samples suggest that Christmas/ass neuron is also triggered by the word “butt”
“Looking to neuroscience, they might sound like “grandmother neurons,” -- I’d like to see a more detailed explanation of the “grandmother neurons” concept here.
Footnotes are unaligned, i.e. footnote “29” in the paper shows up as “30” in the footnote list in the end.
Footnote 29or30: “Neuron activations tend to follow an exponential distribution in their tails”. In my experience the tails are much heavier on the positive side, than on the negative. Likely caused but ReLUs, this phenomena may have some interesting philosophical implications, i.e. it’s much easier to prove that something exists, than that it doesn’t.
Fig 6. Really awesome. But where are the US cities?
“(this is a terrible description, please change)” - in Microscope
Image on top of fig8. Why does it use facial facet for “birds”, ‘string instruments’ and ‘dogs’?
Fig 8. Really love the sparse matrix of class-category relation!
What is the Imagenet performance of the model in case of dense linear probes?
“The two sides meet at the end, going through some processing and .... If we ignore spatial structure.” The procedure seems to be a bit similar with this and other works on weighted pooling. Footnote 34 may need a bit more explanation. For example it should be made clear that averaging happens over the spatial dimensions. Where do those weight matrices come from? Are they parts of the CLIP model?
Fig9 “celebration” image is just noise.
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results
The text was updated successfully, but these errors were encountered: