Using DNNs to understand the primate vision: A shortcut or a distraction?

Yaoda Xu; Maryam Vaziri-Pashkam

doi:10.1017/S0140525X23001528

Using DNNs to understand the primate vision: A shortcut or a distraction?

Published online by Cambridge University Press: 06 December 2023

Yaoda Xu

and

Maryam Vaziri-Pashkam

Show author details

Yaoda Xu: Affiliation:
Department of Psychology, Yale University, New Haven, CT, USA yaoda.xu@yale.edu, https://sites.google.com/view/yaodaxu/home
Maryam Vaziri-Pashkam: Affiliation:
National Institute of Mental Health, Bethesda, MD, USA maryam.vaziri-pashkam@nih.gov, https://mvaziri.github.io/Homepage/Bio.html

Article contents

Abstract
Author's contribution
Financial support
Competing interest
References

Rights & Permissions

Abstract

Bowers et al. bring forward critical issues in the current use of deep neural networks (DNNs) to model primate vision. Our own research further reveals fundamentally different algorithms utilized by DNNs for visual processing compared to the brain. It is time to reemphasize the value of basic vision research and put more resources and effort on understanding the primate brain itself.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e413

DOI: https://doi.org/10.1017/S0140525X23001528 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

Similarities exist between deep neural networks (DNNs) and the primate brain in how they process visual information. This has generated the excitement that perhaps the algorithms governing high-level vision would “automagically” emerge in DNNs to provide us with a shortcut to understand and model primate vision. In their detailed critiques, Bowers et al. bring forward significant drawbacks in the current applications of DNNs to explain primate vision. Perhaps it is time to step back and ask: Is it really a shortcut or a distraction to use DNNs to understand the primate vision?

Using detailed examples, Bowers et al. pointed out that performance alone does not constitute as good evidence that the same processing algorithms are utilized by both the primate brain and DNNs. They showed that DNNs fail to account for a large number of findings in vision research. In our own research, by comparing DNN responses to our previously collected fMRI datasets (Vaziri-Pashkam, Taylor, & Xu, Reference Vaziri-Pashkam, Taylor and Xu2019; Vaziri-Pashkam & Xu, Reference Vaziri-Pashkam and Xu2019), we found that DNNs’ performance is related to the fact that they are built following the known architecture of the primate lower visual areas and are trained with real-world object images. Consequently, DNNs are successful at fully capturing the visual representational structures of lower human visual areas during the processing of real-world images, but not those of higher human visual areas during the processing of these images or that of artificial images at either level of processing (Xu & Vaziri-Pashkam, Reference Xu and Vaziri-Pashkam2021a). The close brain–DNN correspondence found in earlier fMRI studies appears to be overly optimistic by including only real-world objects and compared to brain data with relatively lower power. When we expanded the comparisons to a broader set of real-world stimuli and to artificial stimuli as well as comparing to brain data with a higher power, this correspondence becomes much weaker.

Perhaps the most troubling finding from our research is that DNNs do not form the same transformation-tolerant visual object representations as the human brain does. Decades of neuroscience research has shown that one of the greatest achievements of primate high-level vision is its ability to extract object identity among changes in nonidentity features to form transformation-tolerant object representations (DiCarlo & Cox, Reference DiCarlo and Cox2007; DiCarlo, Zoccolan, & Rust, Reference DiCarlo, Zoccolan and Rust2012; Tacchetti, Isik, & Poggio, Reference Tacchetti, Isik and Poggio2018). This allows us to rapidly recognize an object under different viewing conditions. Computationally, achieving tolerance reduces the complexity of learning by requiring fewer training examples and improves generalization to objects and categories not included in training (Tacchetti et al., Reference Tacchetti, Isik and Poggio2018). We found that while the object representational geometry was increasingly tolerant to changes in nonidentity features from lower to higher human visual areas, this was not the case in DNNs pretrained for object classification regardless of network architecture, depth, with/without recurrent processing, or with/without pretraining to emphasize shape processing (Xu & Vaziri-Pashkam, Reference Xu and Vaziri-Pashkam2022). By comparing DNN responses with another existing fMRI dataset (Jeong & Xu, Reference Jeong and Xu2017), we further showed that while the human higher visual areas exhibit clutter tolerance, such that fMRI responses to an object pair can be well approximated by the average responses to each constituent object shown alone, this was not the case in DNNs (Mocz, Jeong, Chun, & Xu, Reference Mocz, Jeong, Chun and Xu2023). We additionally found that DNNs differ from the human visual areas in how they represent object identity and nonidentity features over the course of visual processing (Xu & Vaziri-Pashkam, Reference Xu and Vaziri-Pashkam2021b). With their vast computing power, DNNs likely associate different instances of an object to a label without preserving the object representational geometry across nonidentity feature changes to form brain-like tolerance. While this is one way to achieve tolerance, it requires a large number of training data and has a limited ability to generalize to objects not included in the training, the two major drawbacks associated with the current DNNs (Serre, Reference Serre2019).

If DNNs use fundamentally different algorithms for visual processing, then in what way do they provide shortcuts, rather than distractions, to help us understand primate vision? It may be argued that since DNNs are the current best models in producing human-like behavior, we should keep refining them using our knowledge of the primate brain. This practice, however, relies on a thorough understanding of the primate brain. If we could already accomplish this, do we still need DNN modeling? As stated by Kay (Reference Kay2018), given that DNNs typically contain millions or even hundreds of millions of free parameters, even if we are successful in duplicating the primate brain in DNNs, how does replacing one black box (the primate brain) with another black box (a DNN) constitute a fundamental understanding of primate vision? Perhaps it is time to reemphasize the value of basic vision and neuroscience research and put more effort and resources on understanding the precise algorithms used by the primate brain in visual processing.

While current DNNs may not provide an easy and quick shortcut to understanding primate vision, can they still be useful? Some have used DNNs to test our theories about the topographic (Blauch, Behrmann, & Plaut, Reference Blauch, Behrmann and Plaut2022) and anatomical organization of the brain (Bakhtiari, Mineault, Lillicrap, Pack, & Richards, Reference Bakhtiari, Mineault, Lillicrap, Pack and Richards2021) and to answer “why” brains work the way they do (Kanwisher et al., Reference Kanwisher, Khosla and Dobs2023). Here again, when our theories about the brain are not borne out in DNNs, are our theories wrong or are DNNs just ill models in those regards? It remains to be seen if such approaches can bring us fundamental understanding of the brain beyond what we already know. Although DNNs may yet to possess the explanation power we desire, they could nevertheless serve as powerful simulation tools to aid vision research. For example, we have recently used DNNs to fine tune our visual stimuli and help us lay out the detailed analysis pipeline that we plan to use to study visual processing in the human brain (e.g., Tang, Chin, Chun, & Xu, Reference Tang, Chin, Chun and Xu2022; Taylor & Xu, Reference Taylor and Xu2021). DNNs are likely here to stay. Understanding their drawbacks and finding the right way to harness their power will be the key for future vision research.

Author's contribution

Y. X. wrote the manuscript with comments from M. V.-P.

Financial support

Y. X. was supported by the National Institute of Health (NIH) Grant 1R01EY030854. M. V.-P. was supported by NIH Intramural Research Program ZIA MH002035.

Competing interest

None.

References

Bakhtiari, S., Mineault, P., Lillicrap, T., Pack, C., & Richards, B. (2021). The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning. Advances in Neural Information Processing Systems, 34, 25164–25178.Google Scholar

Blauch, N. M., Behrmann, M., & Plaut, D. C. (2022). A connectivity-constrained computational account of topographic organization in primate high-level visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 119, e2112566119.CrossRef Google Scholar PubMed

DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant object recognition. Trends in Cognitive Science, 11, 333–341.CrossRef Google Scholar PubMed

DiCarlo, J. J., Zoccolan, D., & Rust, R. C. (2012). How does the brain solve visual object recognition? Neuron, 73, 415–434.CrossRef Google Scholar PubMed

Jeong, S. K., & Xu, Y. (2017). Task-context dependent linear representation of multiple visual objects in human parietal cortex. Journal of Cognitive Neuroscience, 29, 1778–1789.CrossRef Google Scholar PubMed

Kanwisher, N., Khosla, M., & Dobs, K. (2023). Using artificial neural networks to ask ‘why’ questions of minds and brains. Trends in Neuroscience, 46, 240–254.CrossRef Google Scholar PubMed

Kay, K. N. (2018). Principles for models of neural information processing. NeuroImage, 180, 101–109.CrossRef Google Scholar PubMed

Mocz, V., Jeong, S. K., Chun, M., & Xu, Y. (2023). The representation of multiple visual objects in human ventral visual areas and in convolutional neural networks. Scientific Reports, 13, 9088.CrossRef Google Scholar

Serre, T. (2019). Deep learning: the good, the bad, and the ugly. Annual Review of Vision Science, 5, 399–426.CrossRef Google Scholar PubMed

Tacchetti, A., Isik, L., & Poggio, T. A. (2018). Invariant recognition shapes neural representations of visual input. Annual Review of Vision Science, 4, 403–422.CrossRef Google Scholar PubMed

Tang, K., Chin, M., Chun, M., & Xu, Y. (2022). The contribution of object identity and configuration to scene representation in convolutional neural networks. PLoS ONE, 17, e0270667.CrossRef Google Scholar PubMed

Taylor, J., & Xu, Y. (2021). Joint representation of color and shape in convolutional neural networks: A stimulus-rich network perspective. PLoS ONE, 16, e0253442.CrossRef Google Scholar

Vaziri-Pashkam, M., Taylor, J., & Xu, Y. (2019). Spatial frequency tolerant visual object representations in the human ventral and dorsal visual processing pathways. Journal of Cognitive Neuroscience, 31, 49–63.CrossRef Google Scholar PubMed

Vaziri-Pashkam, M., & Xu, Y. (2019). An information-driven 2-pathway characterization of occipitotemporal and posterior parietal visual object representations. Cerebral Cortex, 29, 2034–2050.CrossRef Google Scholar PubMed

Xu, Y., & Vaziri-Pashkam, M. (2021a). Limited correspondence in visual representation between the human brain and convolutional neural networks. Nature Communications, 12, 2065.CrossRef Google Scholar PubMed

Xu, Y., & Vaziri-Pashkam, M. (2021b). The coding of object identity and nonidentity features in human occipito-temporal cortex and convolutional neural networks. Journal of Neuroscience, 41, 4234–4252.CrossRef Google Scholar PubMed

Xu, Y., & Vaziri-Pashkam, M. (2022). Understanding transformation tolerant visual object representations in the human brain and convolutional neural networks. NeuroImage, 263, 119635.CrossRef Google Scholar PubMed