Hostname: page-component-7bb8b95d7b-dtkg6 Total loading time: 0 Render date: 2024-09-27T21:17:02.625Z Has data issue: false hasContentIssue false

Implications of capacity-limited, generative models for human vision

Published online by Cambridge University Press:  06 December 2023

Joseph Scott German
Affiliation:
Department of Cognitive Science, University of California, San Diego, La Jolla, CA, USA jgerman@ucsd.edu
Robert A. Jacobs
Affiliation:
Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA rjacobs@ur.rochester.edu https://www2.bcs.rochester.edu/sites/jacobslab/people.html

Abstract

Although discriminative deep neural networks are currently dominant in cognitive modeling, we suggest that capacity-limited, generative models are a promising avenue for future work. Generative models tend to learn both local and global features of stimuli and, when properly constrained, can learn componential representations and response biases found in people's behaviors.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

The target article offers cogent criticisms of deep neural networks (DNNs) as models of human cognition. Although discriminative DNNs are currently dominant in cognitive modeling, other approaches are needed if we are to achieve a satisfactory understanding of human cognition. We suggest that generative models are a promising avenue for future work, particularly capacity-limited, generative models designed around componential representations (e.g., part-based representations of visual objects and scenes).

A generative model is a model that learns a joint distribution of visible (i.e., observed) and hidden (latent) variables. Importantly, many generative models allow us to sample from the distribution learned by the model, producing “synthetic” examples of the concept modeled by the distribution. Using a generative model to make inferences about external stimuli is a matter of identifying the properties of the generative model most likely to have produced these stimuli. By their very nature, generative models neatly sidestep many of the issues with discriminative models, as described in the target article.

Most obviously yet perhaps most importantly, they are typically judged not based on predictive performance, but on their ability to synthesize examples of concepts, which requires a more profound understanding of those concepts than does mere discrimination, potentially leading to task-general representations capable of explaining far more of human perceptual and cognitive reasoning. For example, unlike discriminative models trained to categorize images, which tend to base their decisions on texture patches and local shape instead of global shape as humans do, a successful generative model must include an understanding of global object shape, as otherwise its samples would not be realistic. Inference in such a generative model would therefore be sensitive to object shape as a matter of course, as well as a number of other properties that might be ignored by a discriminatively trained model.

Another important feature of human cognition not captured by large DNNs is capacity limits. People cannot remember all aspects of a visual environment, and so human vision needs to be selective and efficient. By contrast, DNNs often contain billions of adaptable parameters, providing them with enormous learning, representational, and processing capacities. These seemingly unlimited capacities are in stark contrast to the dramatically limited capacities of biological vision, as noted in the target article. This need for efficiency underlies people's attentional and memory biases. People are biased toward “filling-in” missing features (i.e., features not attended or remembered) with values that are highly frequent in the environment. In addition, people are biased toward attending to and remembering those features which are most relevant for their current goal, thereby maximizing task performance.

Bates, Lerch, Sims, and Jacobs (Reference Bates, Lerch, Sims and Jacobs2019) experimentally evaluated these biases using an optimal model of capacity-limited visual working memory (VWM) based on “rate-distortion theory” (RDT; see Sims, Jacobs, & Knill, Reference Sims, Jacobs and Knill2012). Both biases were predicted by the RDT model: An optimal VWM should be biased toward allocating its limited memory resources toward high-probability feature values and toward task-relevant features. Bates and Jacobs (Reference Bates and Jacobs2021) studied people's responses in the domain of visual search and attention. The RDT model predicted important aspects of these responses, including “set-size” effects indicative of limited capacity, aspects not accounted for by a model based on Bayesian decision theory.

In accord with these ideas, a popular form of generative model, a “variational autoencoder” (VAE) uses a loss function during training that penalizes a large growth in capacity. A VAE maps an input through one or more hidden layers, with a penalized capacity at one of the layers, to an output layer that attempts to reconstruct the input. Reconstructions are typically imperfect due to the “lossy” representations at the “bottleneck” hidden layer with restricted capacity. Machine learning researchers have shown important mathematical relationships between VAEs and RDT (Alemi et al., Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2017, Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2018; Ballé, Laparra, & Simoncelli, Reference Ballé, Laparra and Simoncelli2016; Burgess et al., Reference Burgess, Higgins, Pal, Matthey, Watters, Desjardins and Lerchner2018). Bates and Jacobs (Reference Bates and Jacobs2020) used VAEs to model biases and set-size effects in human visual perception and memory. We believe this is an encouraging early step toward developing capacity-limited, generative models of human vision.

The desire for efficient representations also leads to componential or part-based approaches, and generative models naturally lend themselves to understanding concepts based on parts and relationships between them, as humans do (in contrast to DNNs, as the target article points out, citing German and Jacobs, Reference German and Jacobs2020, and Erdogan and Jacobs, Reference Erdogan and Jacobs2017). The same basic parts can be used to create a wide variety of distinct objects, just by changing the relationships between them (the basis of many perceptual and cognitive models such as Biederman, Reference Biederman1987). Learning new object concepts thereby becomes more efficient, as once a part has been learned, it can be used in the representation and construction of any object concept using it, including new ones. This idea can be further extended by supposing that parts are made out of subparts, and so on, producing hierarchical, componential generative models (e.g., Lake, Salakhutdinov, & Tenenbaum, Reference Lake, Salakhutdinov and Tenenbaum2015; Nash & Williams, Reference Nash and Williams2017).

To be sure, a capacity-limited, generative approach is not going to “solve” cognitive modeling overnight. It still faces major obstacles such as computationally expensive inference and a lack of objective criteria with which to judge the quality of its synthesized instances. However, we are optimistic that these issues can be resolved, and we hope the target article inspires researchers to look beyond the established discriminative DNN paradigm. Perhaps if capacity-limited, generative models receive as much research attention and development as discriminative models have, we can look forward to significant advances in both computational cognitive modeling and machine learning.

Financial support

This work was funded by NSF research grants BCS-1824737 and DRL-1561335.

Competing interest

None.

References

Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., & Murphy, K. (2017). An information-theoretic analysis of deep latent variable models. arXiv preprint arXiv:1711.00464. Retrieved from https://arxiv.org/pdf/1711.00464v1.pdfGoogle Scholar
Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., & Murphy, K. (2018). Fixing a broken ELBO. arXiv preprint arXiv:1711.00464v3. Retrieved from https://arxiv.org/pdf/1711.00464v3.pdfGoogle Scholar
Ballé, J., Laparra, V., & Simoncelli, E. P. (2016). End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Retrieved from https://arxiv.org/pdf/1611.01704.pdfGoogle Scholar
Bates, C. J., & Jacobs, R. A. (2020). Efficient data compression in perception and perceptual memory. Psychological Review, 127, 891917.CrossRefGoogle ScholarPubMed
Bates, C. J., & Jacobs, R. A. (2021). Optimal attentional allocation in the presence of capacity constraints in uncued and cued visual search. Journal of Vision, 21(5), 3, 1–23.CrossRefGoogle ScholarPubMed
Bates, C. J., Lerch, R. A., Sims, C. R., & Jacobs, R. A. (2019). Adaptive allocation of human visual working memory capacity during statistical and categorical learning. Journal of Vision, 19(2), 11, 1–23.CrossRefGoogle ScholarPubMed
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115147.CrossRefGoogle ScholarPubMed
Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., … Lerchner, A. (2018). Understanding disentangling in β-VAE. arXiv preprint arXiv:1804.03599. Retrieved from https://arxiv.org/pdf/1804.03599.pdfGoogle Scholar
Erdogan, G., & Jacobs, R. A. (2017). Visual shape perception as Bayesian inference of 3D object-centered shape representations. Psychological Review, 124, 740761.CrossRefGoogle ScholarPubMed
German, J. S., & Jacobs, R. A. (2020). Can machine learning account for human visual object shape similarity judgments? Vision Research, 167, 8799.CrossRefGoogle ScholarPubMed
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science (New York, N.Y.), 350(6266), 13321338.CrossRefGoogle ScholarPubMed
Nash, C., & Williams, C. K. I. (2017). The shape variational autoencoder: A deep generative model of part-segmented 3D objects. Eurographics Symposium on Geometry Processing, 36(5), 111.Google Scholar
Sims, C. R., Jacobs, R. A., & Knill, D. C. (2012). An ideal observer analysis of visual working memory. Psychological Review, 119, 807830.CrossRefGoogle ScholarPubMed