Perceptual learning in humans: An active, top-down-guided process

Heleen A. Slagter

doi:10.1017/S0140525X23001644

Perceptual learning in humans: An active, top-down-guided process

Published online by Cambridge University Press: 06 December 2023

Heleen A. Slagter

Show author details

Heleen A. Slagter*: Affiliation:
Department of Cognitive Psychology, Institute for Brain and Behavior Amsterdam, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands h.a.slagter@vu.nl https://research.vu.nl/en/persons/heleen-slagter

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Deep neural network (DNN) models of human-like vision are typically built by feeding blank slate DNN visual images as training data. However, the literature on human perception and perceptual learning suggests that developing DNNs that truly model human vision requires a shift in approach in which perception is not treated as a largely bottom-up process, but as an active, top-down-guided process.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e406

DOI: https://doi.org/10.1017/S0140525X23001644 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

Bowers et al. do the field a service with their thought-provoking commentary. If the problems currently characterizing deep neural network (DNN) models of human vision laid out in the target article are not adequately addressed, the field risks another winter. Bowers et al. sketch one important way forward: Building DNNs that can account for psychological data. I put forward that developing DNNs of human vision will foremost require a conceptual shift: From approaching perception as the outcome of a largely stimulus-driven process of feature detection and object recognition to treating perception as an active, top-down-guided process.

Building on the traditional notion of perception as a largely bottom-up process, mainstream computational cognitive neuroscience currently embraces the idea that simply feeding blank slate DNNs large amounts of training data will produce human-like vision. Yet, as Bowers et al.'s overview shows, this may not yet be the case. Based on the literature on human perceptual learning and action-oriented theories of perception, I contend that this may be a direct result from the manner in which DNNs are typically trained to “perceive”: In a passive, data-driven manner. This approach typically does not induce perceptual learning that generalizes to new stimuli or tasks in humans (Lu & Dosher, Reference Lu and Dosher2022), is very different from how babies learn to perceive (Emberson, Reference Emberson and Benson2017; Zaadnoordijk, Besold, & Cusack, Reference Zaadnoordijk, Besold and Cusack2022), and does not take into account the action-oriented nature of perception (Friston, Reference Friston2009; Gibson, Reference Gibson2014; Hurley, Reference Hurley2001).

The most consistent finding in the literature on visual perceptual learning in human adults is that learning is highly specific to the trained stimuli and tasks (Lu & Dosher, Reference Lu and Dosher2022). For example, improvements are often not observed if the test stimulus has a different orientation or contrast than the trained stimulus or when the trained stimulus is relocated or rotated (Fahle, Reference Fahle2004; Fiorentini & Berardi, Reference Fiorentini and Berardi1980). These findings indicate that the typical outside-in approach used in perceptual learning studies in which participants are presented with stimuli to detect or categorize tends to induce learning at too low levels in the processing hierarchy to support feature-, stimulus-, or view-independent learning. Indeed, more recent research suggests that transfer learning can be enhanced when learning can be top-down guided and connect to higher levels in the processing hierarchy (Tan, Wang, Sasaki, & Watanabe, Reference Tan, Wang, Sasaki and Watanabe2019). For example, when the training procedure allowed for more abstract rule formation, complete transfer of learning between physically different stimuli was observed (Wang et al., Reference Wang, Wang, Zhang, Xie, Yang, Luo and Li2016). These observations fit with recent findings that perceptual learning involves higher cognitive areas (Shibata, Sagi, & Watanabe, Reference Shibata, Sagi and Watanabe2014; Zhang et al., Reference Zhang, Zhang, Xiao, Klein, Levi and Yu2010) and proposals that perceptual learning is a top-down-guided process (Ahissar & Hochstein, Reference Ahissar and Hochstein2004). Perceptual development in infants is also more top-down guided than traditionally assumed (Emberson, Reference Emberson and Benson2017) and perception continues to develop through childhood based on acquired knowledge across a range of tasks (Milne et al., Reference Milne, Lisi, McLean, Zheng, Groen and Dekker2022). Yet the building of models of human vision still typically starts from the notion that human-like vision will simply arise by feeding blank slate DNNs many supervised training images, which may cause learning at too low levels in the processing hierarchy. Indeed, as Bowers et al. summarize, DNNs can be fooled by additive noise (Heaven, Reference Heaven2019), have difficulty generalizing learning to novel objects, and do not form transformation-tolerant object identity representations at higher layers (Xu & Vaziri-Pashkam, Reference Xu and Vaziri-Pashkam2021). These problems conceivably reflect insufficient top-down-guided learning.

Research also shows that perception and action are interdependent processes, in particular during the development of perception (Zaadnoordijk et al., Reference Zaadnoordijk, Besold and Cusack2022). For example, kittens that are passively moved around, do not develop depth perception (Held & Hein, Reference Held and Hein1963), just like DNN's that are fed visual input do not perceive depth (Jacob, Pramod, Katti, & Arun, Reference Jacob, Pramod, Katti and Arun2021). Humans are not passive perceivers, but continuously build on past experiences to actively predict and generate their own sensory information through action, thereby top-down driving their own learning (Boonstra & Slagter, Reference Boonstra and Slagter2019; Buzsáki, Reference Buzsáki2019; Friston, Reference Friston2009; Gibson, Reference Gibson1988). That perception incorporates expectations about the sensory outcome of actions is demonstrated by the fact that humans who wear goggles that flip the visual field from left to right, do not perceive a normal world (albeit flipped left-right), but experience distorted perception (Kohler, Reference Kohler1963), caused by the disruption of normal sensorimotor contingencies. Moreover, recent studies show that responses in early visual cortex also reflect actions (Schneider, Reference Schneider2020). These findings cannot easily be explained by the classical view of the brain as processing information serially from sensory to cognitive to motor control stages (Hurley, Reference Hurley2001), each subserved by distinct brain regions – a view that currently still drives much of DNN research. Rather, they indicate that perception emerges from dynamic feedback relations between input and output, and does not merely entail the encoding of environmental statistics, but also the statistics of agent–environment interactions (Friston, Reference Friston2010). Yet DNNs are generally trained in a passive way. This may cause shortcut learning or DNNs to latch onto features that do not matter to humans in categorizing objects. DNNs may focus on texture (Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2022) or local rather than global shape (Baker, Lu, Erlikhman, & Kellman, Reference Baker, Lu, Erlikhman and Kellman2018), because they never had to interact with objects, for which global shape knowledge is important. Notably, the development of global shape representations may depend on the dorsal stream (Ayzenberg & Behrmann, Reference Ayzenberg and Behrmann2022).

To develop models of human-like vision, the field thus needs to turn the notion of perception on its head: From bottom-up driven to top-down guided and fundamentally serving agent–environment interactions. Important steps are already taken in this direction. For example, DNN architectures wired to top-down infer their sensory input has been shown to work at scale (Millidge, Salvatori, Song, Bogacz, & Lukasiewicz, Reference Millidge, Salvatori, Song, Bogacz and Lukasiewicz2022). There are also exciting developments in robotics, in which artificial systems equipped with the possibility to predict and generate their sensory information through action can top-down drive their own learning (Lanillos et al., Reference Lanillos, Meo, Pezzato, Meera, Baioumy, Ohata and Tani2021). DNNs have the potential to provide powerful ways to study the human brain and behavior, but this will require the incorporation of biologically realistic, action-oriented learning algorithms, grounding vision on interactions with the environment.

Financial support

H. A. S. is supported by a consolidator ERC grant “PlasticityOfMind” (101002584).

Competing interest

None.

References

Ahissar, M., & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual learning. Trends in Cognitive Sciences, 8(10), 457–464. https://doi.org/10.1016/j.tics.2004.08.011CrossRef Google Scholar PubMed

Ayzenberg, V., & Behrmann, M. (2022). Does the brain's ventral visual pathway compute object shape? Trends in Cognitive Sciences, 26(12), 1119–1132. https://doi.org/10.1016/j.tics.2022.09.019CrossRef Google Scholar PubMed

Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14(12), e1006613. https://doi.org/10.1371/journal.pcbi.1006613CrossRef Google Scholar

Boonstra, E. A., & Slagter, H. A. (2019). The dialectics of free energy minimization. Frontiers in Systems Neuroscience, 13, 42. https://doi.org/10.3389/fnsys.2019.00042CrossRef Google Scholar PubMed

Buzsáki, G. (2019). The brain from inside out. Oxford University Press.CrossRef Google Scholar

Emberson, L. L. (2017). Chapter One – How does experience shape early development? Considering the role of top-down mechanisms. In Benson, J. B. (Ed.), Advances in child development and behavior (Vol. 52, pp. 1–41). JAI. https://doi.org/10.1016/bs.acdb.2016.10.001Google Scholar

Fahle, M. (2004). Perceptual learning: A case for early selection. Journal of Vision, 4(10), 4. https://doi.org/10.1167/4.10.4CrossRef Google Scholar PubMed

Fiorentini, A., & Berardi, N. (1980). Perceptual learning specific for orientation and spatial frequency. Nature, 287(5777), 43–44. https://doi.org/10.1038/287043a0CrossRef Google Scholar PubMed

Friston, K. (2009). The free-energy principle: A rough guide to the brain? Trends in Cognitive Sciences, 13(7), 293–301. https://doi.org/10.1016/j.tics.2009.04.005CrossRef Google Scholar PubMed

Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787CrossRef Google Scholar PubMed

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2022). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv, arXiv:1811.12231. https://doi.org/10.48550/arXiv.1811.12231Google Scholar

Gibson, E. J. (1988). Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledge. Annual Review of Psychology, 39, 1–41.CrossRef Google Scholar

Gibson, J. J. (2014). The ecological approach to visual perception (1st ed.). Routledge.CrossRef Google Scholar

Heaven, D. (2019). Why deep-learning AIs are so easy to fool. Nature, 574(7777), 163–166. https://doi.org/10.1038/d41586-019-03013-5CrossRef Google Scholar PubMed

Held, R., & Hein, A. (1963). Movement-produced stimulation in the development of visually guided behavior. Journal of Comparative and Physiological Psychology, 56, 872–876. https://doi.org/10.1037/h0040546CrossRef Google Scholar PubMed

Hurley, S. (2001). Perception and action: Alternative views. Synthese, 129(1), 3–40. https://doi.org/10.1023/A:1012643006930CrossRef Google Scholar

Jacob, G., Pramod, R. T., Katti, H., & Arun, S. P. (2021). Qualitative similarities and differences in visual object representations between brains and deep networks. Nature Communications, 12(1), Article 1. https://doi.org/10.1038/s41467-021-22078-3CrossRef Google Scholar PubMed

Kohler, I. (1963). The formation and transformation of the perceptual world. Psychological Issues, 3, 1–173.Google Scholar

Lanillos, P., Meo, C., Pezzato, C., Meera, A. A., Baioumy, M., Ohata, W., … Tani, J. (2021). Active inference in robotics and artificial agents: Survey and challenges. arXiv, arXiv:2112.01871. https://doi.org/10.48550/arXiv.2112.01871Google Scholar

Lu, Z.-L., & Dosher, B. A. (2022). Current directions in visual perceptual learning. Nature Reviews Psychology, 1(11), Article 11. https://doi.org/10.1038/s44159-022-00107-2CrossRef Google Scholar PubMed

Millidge, B., Salvatori, T., Song, Y., Bogacz, R., & Lukasiewicz, T. (2022). Predictive coding: Towards a future of deep learning beyond backpropagation? arXiv, arXiv:2202.09467. https://doi.org/10.48550/arXiv.2202.09467Google Scholar

Milne, G. A., Lisi, M., McLean, A., Zheng, R., Groen, I. I. A., & Dekker, T. M. (2022). Emergence of perceptual reorganisation from prior knowledge in human development and convolutional neural networks. BioRxiv. https://doi.org/10.1101/2022.11.21.517321Google Scholar

Schneider, D. M. (2020). Reflections of action in sensory cortex. Current Opinion in Neurobiology, 64, 53–59. https://doi.org/10.1016/j.conb.2020.02.004CrossRef Google Scholar PubMed

Shibata, K., Sagi, D., & Watanabe, T. (2014). Two-stage model in perceptual learning: Toward a unified theory. Annals of the New York Academy of Sciences, 1316(1), 18–28. https://doi.org/10.1111/nyas.12419CrossRef Google Scholar

Tan, Q., Wang, Z., Sasaki, Y., & Watanabe, T. (2019). Category-induced transfer of visual perceptual learning. Current Biology, 29(8), 1374–1378. e3. https://doi.org/10.1016/j.cub.2019.03.003CrossRef Google Scholar PubMed

Wang, R., Wang, J., Zhang, J.-Y., Xie, X.-Y., Yang, Y.-X., Luo, S.-H., … Li, W. (2016). Perceptual learning at a conceptual level. The Journal of Neuroscience, 36(7), 2238–2246. https://doi.org/10.1523/JNEUROSCI.2732-15.2016CrossRef Google Scholar

Xu, Y., & Vaziri-Pashkam, M. (2021). Limits to visual representational correspondence between convolutional neural networks and the human brain. Nature Communications, 12(1), Article 1. https://doi.org/10.1038/s41467-021-22244-7Google Scholar PubMed

Zaadnoordijk, L., Besold, T. R., & Cusack, R. (2022). Lessons from infant learning for unsupervised machine learning. Nature Machine Intelligence, 4(6), 510–520. https://doi.org/10.1038/s42256-022-00488-2CrossRef Google Scholar

Zhang, J.-Y., Zhang, G.-L., Xiao, L.-Q., Klein, S. A., Levi, D. M., & Yu, C. (2010). Rule-based learning explains visual perceptual learning and its specificity and transfer. The Journal of Neuroscience, 30(37), 12323–12328. https://doi.org/10.1523/JNEUROSCI.0704-10.2010CrossRef Google Scholar PubMed