Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-20T13:55:38.743Z Has data issue: false hasContentIssue false

For deep networks, the whole equals the sum of the parts

Published online by Cambridge University Press:  06 December 2023

Philip J. Kellman
Affiliation:
Department of Psychology and David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA kellman@cognet.ucla.edu; https://kellmanlab.psych.ucla.edu/
Nicholas Baker
Affiliation:
Department of Psychology, Loyola University of Chicago, Chicago, IL, USA nbaker1@ucla.edu; https://www.luc.edu/psychology/people/staff/facultyandstaff/nicholasbaker/
Patrick Garrigan
Affiliation:
Department of Psychology, St. Joseph's University, Philadelphia, PA, USA patrick.garrigan@sju.edu; https://sjupsych.org/faculty_pg.php
Austin Phillips
Affiliation:
Department of Psychology, University of California, Los Angeles, Los Angeles, CA, USA asphillips@ucla.edu; https://kellmanlab.psych.ucla.edu/
Hongjing Lu
Affiliation:
Department of Psychology and Department of Statistics, University of California, Los Angeles, Los Angeles, CA, USA hongjing@ucla.edu; https://cvl.psych.ucla.edu/

Abstract

Deep convolutional networks exceed humans in sensitivity to local image properties, but unlike biological vision systems, do not discover and encode abstract relations that capture important properties of objects and events in the world. Coupling network architectures with additional machinery for encoding abstract relations will make deep networks better models of human abilities and more versatile and capable artificial devices.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

Bowers et al. raise questions about the validity of methods and types of evidence used to compare deep networks to human vision. Their discussion also draws attention to serious limitations of convolutional neural networks for understanding vision. Here we focus on two ideas. First, compelling evidence is emerging that deep networks do not capture pervasive and powerful aspects of visual perceptual capabilities in humans. These limitations appear to be fundamental and relate to the lack of mechanisms for extracting and encoding abstract relations. Second, the problem of mimicry, both in comparing network and human responses in behavioral tasks and in comparing model unit activations to brain data, highlights a general difficulty in using potentially superficial similarities across systems to draw deep conclusions. We conclude by suggesting that understanding the mechanisms of visual perception will likely require synergies between network processing and processes that accomplish symbolic encoding of abstract relations.

Abstract relations in perception

Human perception derives abstract, symbolic representations from relational information in sensory input (e.g., Baker & Kellman, Reference Baker and Kellman2018), enabling visual representations to be widely useful in thinking and learning (Kellman & Massey, Reference Kellman, Massey and Ross2013). Processes like those found in deep convolutional neural networks (DCNNs) may be an important part of human vision, but their anchor in concrete, pixel-level properties makes them unlikely to be sufficient. DCNNs differ from human perceivers profoundly, for example, in their access to shape information (Baker, Lu, Erlikhman, & Kellman, Reference Baker, Lu, Erlikhman and Kellman2018, Reference Baker, Lu, Erlikhman and Kellman2020; Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019; Malhotra, Dujmović, & Bowers, Reference Malhotra, Dujmović and Bowers2022). Whereas shape is the pre-eminent driver of human object recognition, when shape and texture conflict, networks classify by texture. Humans readily see shape in glass figurines, but networks consistently misclassify these (e.g., labeling a robin as a shower cap, a fox as a chain, and a polar bear as a can opener). Silhouettes do better, producing around 40% accuracy (Baker et al., Reference Baker, Lu, Erlikhman and Kellman2018; Kubilius, Bracci, & de Beeck, Reference Kubilius, Bracci and de Beeck2016), but rearranging their parts, which severely impairs human classification, has strikingly little effect on network responses. Conversely, for correctly classified silhouettes, adding small serrations along the boundary reduces network classifications to chance or below, while human perceivers are unaffected. These and other results indicate that networks extract local shape features but have little or no access to global shape (Baker et al., Reference Baker, Lu, Erlikhman and Kellman2018, Reference Baker, Lu, Erlikhman and Kellman2020).

Recent research suggests that these findings regarding global shape reflect broader limitations in DCNNs' abilities to capture abstract relations from visual inputs. Baker, Garrigan, Phillips, and Kellman (Reference Baker, Garrigan, Phillips and Kellman2023) attempted to train DCNNs to capture several perceptual relations that human perceivers detect readily and generalize from even a small number of examples. These included the same/different relation (Puebla & Bowers, Reference Puebla and Bowers2021a, Reference Puebla and Bowers2021b), judging if a probe was inside or outside of a closed contour, and comparing the number of sides of two polygons. Using restricted and unrestricted transfer learning with networks previously trained for object classification, we found that networks could come to exceed chance performance on training sets. Subsequent testing with novel displays, however, showed that the relations per se were not learned at all. Although human perceivers rapidly acquired and accurately applied these relations to new displays, networks showed chance performance. The limitation of deep networks in representing and generalizing abstract relations appears to be fundamental and general (see also Malhotra, Dujmović, Hummel, & Bowers, Reference Malhotra, Dujmović, Hummel and Bowers2021).

Methodological issues

The methodological issues raised by Bowers et al. are well-placed. Similarities between model responses and human judgments, and between model activations and brain activations, invite us to think that human processing may resemble deep networks. Yet claims based on both kinds of similarities may be tenuous. In our research, we have often seen deep networks produce somewhat better than chance responding on tests of relational processing, only to find that they were using some obscure, nonrelational property that correlated with the relevant relation.

Parallel concerns apply to similarities between activation patterns in brains and DCNN layers. Although intriguing, it is important to remember that we typically do not know what activations in either layers of a neural network or in the visual brain are signaling. Common representations in two systems might produce similar activation patterns or respectable correlations in representational similarity analyses (RSAs), but it is a case of affirming the consequent when we assume that high representational similarity scores imply common representations (Saxe, McClelland, & Ganguli, Reference Saxe, McClelland and Ganguli2019). These issues may shed light on puzzling results. For example, Fan, Yamins, and Turk-Browne (Reference Fan, Yamins and Turk-Browne2018) interpreted RSA results as suggesting that deep learning systems trained on photographs capture abstract representations such as humans use to see objects from line drawings. RSA was used to correlate similarity matrices obtained for photos and for line drawings; prior work used RSA to argue for quantitative similarities between advanced layers of the model and primate IT. In contrast, we tested classification of outline drawings by deep networks (VGG-19 and AlexNet) trained on ImageNet for object classification and found no evidence of successful classification based on outlines (Baker et al., Reference Baker, Lu, Erlikhman and Kellman2018). For 78% of objects, networks would have done better by choosing an ImageNet category at random, and neither network produced a single correct first-choice classification. Do networks capture an abstract outline representation of objects, as suggested by RSA, yet fail to use it to classify inputs composed solely of outlines? As Bowers et al. suggest, the answer may lie in confounding in the stimulus properties that drive representational similarity (cf. Saxe et al., Reference Saxe, McClelland and Ganguli2019).

Conclusion

A wealth of evidence suggests that biological vision systems extract and represent abstract relations. DCNNs far exceed humans in sensitivity to local image properties, but for humans, local sensory activations are transient, rapidly discarded, and used to discover and encode relations that capture important properties of objects and events in the world. Peering beyond observed similarities that Bowers et al. suggest may be superficial, these differences between networks and brains may be deep and fundamental. More work is needed to discern the sources of the differences. Combining network architectures with additional machinery for encoding abstract relations might make deep networks better models of human abilities and more versatile and capable artificial devices.

Financial support

This work was supported by funding from the National Institutes of Health (P.. K., R01CA236791) and the National Science Foundation (H. L., BCS-2142269).

Competing interest

None.

References

Baker, N., Garrigan, P., Phillips, A., & Kellman, P. J. (2023). Configural relations in humans and deep convolutional neural networks. Frontiers in Artificial Intelligence, 5, 961595. doi:10.3389/frai.2022.961595CrossRefGoogle ScholarPubMed
Baker, N., & Kellman, P. J. (2018). Abstract shape representation in human visual perception. Journal of Experimental Psychology: General, 147(9), 1295.CrossRefGoogle ScholarPubMed
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14(12), e1006613.CrossRefGoogle Scholar
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2020). Local features and global shape information in object classification by deep convolutional neural networks. Vision Research, 172, 4661.CrossRefGoogle ScholarPubMed
Fan, J. E., Yamins, D. L., & Turk-Browne, N. B. (2018). Common object representations for visual production and recognition. Cognitive Science, 42(8), 26702698.CrossRefGoogle ScholarPubMed
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations (ICLR), https://arxiv.org/abs/1811.12231Google Scholar
Kellman, P. J., & Massey, C. M. (2013). Perceptual learning, cognition, and expertise. In Ross, B. H. (Ed.), The psychology of learning and motivation (Vol. 58, pp. 117165). Elsevier.Google Scholar
Kubilius, J., Bracci, S., & de Beeck, H. P. O. (2016). Deep neural networks as a computational model for human shape sensitivity. PLoS Computational Biology, 12(4), e1004896.CrossRefGoogle ScholarPubMed
Malhotra, G., Dujmović, M., & Bowers, J. S. (2022). Feature blindness: A challenge for understanding and modelling visual object recognition. PLoS Computational Biology, 18(5), e1009572.CrossRefGoogle ScholarPubMed
Malhotra, G., Dujmović, M., Hummel, J., & Bowers, J. S. (2021). The contrasting shape representations that support object recognition in humans and CNNs. arXiv preprint, https://doi.org/10.1101/2021.12.14.472546Google Scholar
Puebla, G., & Bowers, J. (2021a). Can deep convolutional neural networks support relational reasoning in the same-different task? arXiv preprint, https://doi.org/10.1101/2021.09.03.458919Google Scholar
Puebla, G., & Bowers, J. (2021b). Can deep convolutional neural networks learn same-different relations?. bioRxiv, 2021-04.CrossRefGoogle Scholar
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 1153711546.CrossRefGoogle ScholarPubMed