After so many years observing the prosecution of p-values and everyday laboratory life, we are pleased to see a growing number of researchers turning their attention to critical matters such as theory development and experimentation (e.g., Proulx & Morey, Reference Proulx and Morey2021). But as we transition into these important new debates, it is crucial to avoid past intellectual excesses. In particular, we note a tendency to embrace passive technological solutions to problems of scientific inference and discovery that make little room for the kind of active theory building and critical thinking that in fact result in meaningful scientific advances (see Singmann et al., Reference Singmann, Kellen, Cox, Chandramouli, Davis-Stober, Dunn and Shiffrin2023). In this vein, we wish to express serious reservations regarding Almaatouq et al.'s critique.
The observation of puzzling, incongruent, and incommensurate results across studies is a common affair in the experimental sciences (see Chang, Reference Chang2004; Galison, Reference Galison1987; Hacking, Reference Hacking1983). Indeed, one of the central roles of experimentation is to “create, produce, refine and stabilize phenomena” (Hacking, Reference Hacking1983, p. 229), which is achieved through an iterative process that includes the ongoing improvement of experimental apparati (see Chang, Reference Chang2004; Trendler, Reference Trendler2009) and relevant variables (Jantzen, Reference Jantzen2021). This process was discussed long ago by Maxwell (Reference Maxwell and Niven1890/1965), who described it as removing the influence of “disturbing agents” from a “field of investigation.”
Looking back at the history of modern memory research, we can identify this process in the development of experimental tasks (e.g., recognition, cued recall) with clear procedures (study/test phases) and stimuli (e.g., high-frequency words). This process is also manifest in the resolution of empirical puzzles, such as the innumerous exceptions, incongruencies, and boundary conditions encountered by researchers in the search for the “laws of memory” (for a review, see Roediger, Reference Roediger2008). Far from insurmountable, these empirical puzzles have been continuously resolved through the interplay of tailored experiments and theories (e.g., Cox & Shiffrin, Reference Cox and Shiffrin2017; Hotaling, Donkin, Jarvstad & Newell, Reference Hotaling, Donkin, Jarvstad and Newell2022; Humphreys, Bain, & Pike, Reference Humphreys, Bain and Pike1989; Roediger & Blaxton, Reference Roediger, Blaxton, Gorfein and Hoffman1987; Seamon et al., Reference Seamon, Williams, Crowley, Kim, Langer, Orne and Wishengrad1995; Turner, Reference Turner2019; Vergauwe & Cowan, Reference Vergauwe and Cowan2015). More specifically, candidate theories are constructed to explain existing results by postulating constructs (e.g., “trace strength”) and specifying how those constructs are related to observables (e.g., “more study time leads to more trace strength which leads to faster response times”). These theories also specify what should not be relevant, thereby identifying potential confounding variables that future experiments should control. For an exemplary case, consider the domain of short-term memory, where we can find a large body of empirical phenomena (e.g., Oberauer et al., Reference Oberauer, Lewandowsky, Awh, Brown, Conway, Cowan and Ward2018) alongside explanatory accounts that can accommodate them (e.g., interference-based theories; see Lewandowsky, Oberauer, & Brown, Reference Lewandowsky, Oberauer and Brown2009).
Against this backdrop, it is difficult to find Almaatouq et al.'s critique convincing. On the one hand, they fail to explain the success of existing experimental practices (e.g., piecemeal testing) in domains such as human memory. On the other, their treatment case studies such as “group synergy,” which has amassed a wealth of conflicting findings, do not include any indication that the process described above has failed. This omission opens a number of possible explanations. For example, incongruent results may reflect experimental artifacts or hidden ceteris paribus clauses and other preconditions (Meehl, Reference Meehl1990, p. 109) – can we really say that these procedures have been thoroughly pursued? Alternatively, incongruent results could be a sign that those results should not be treated as part of the same “space” in the first place, that is, that they do not define a cohesive body of results that can be explained by a common theory.
Moving on to the actual proposal of integrated experiment design (IED), we find its potential contribution to be largely negative. Referring back to Maxwell's (Reference Maxwell and Niven1890/1964) description, what IED proposes is to allow “disturbing agents” back into the “field of investigation” as long as they are appropriately tagged and recorded. It is difficult to imagine how Newton's laws of motion could ever emerge from large-scale experiments evaluating different shapes of objects, velocities, viscosities, surface textures, and so on. Our main concerns with IED are summarized below:
(1) By placing a premium on commensurability, IED decreases the chances of new and unexpected findings (Shiffrin, Börner, & Stigler, Reference Shiffrin, Börner and Stigler2018).
(2) By shifting researchers’ resources toward the joint observation of a large number of factors, IED disrupts the piecemeal efforts in experimentation and theorization that illuminate the processes underlying human data generation. For instance, it makes it difficult to tell an important result from one caused by a confound (for discussions, see Garcia-Marques & Ferreira, Reference Garcia-Marques and Ferreira2011; Kellen, Reference Kellen2019; Shiffrin & Nobel, Reference Shiffrin and Nobel1997).
(3) IED turns existential-abductive reasoning on its head: Instead of developing explanatory constructs (e.g., model development) in response to existing covariational information, a construct would be assumed a priori in the form of an empty vessel, to be later infused by the results of an experiment manipulating factors presumably related to it. For instance, the construct “attention” would be identified with the experimental manipulations thought to be relevant to “attention.” This concern is materialized by the treatment of the so-called Moral Machine, a statistical model summarizing the observed relationships between moral judgments and a host of variables, as a bona fide theory of moral reasoning.
(4) By introducing a large number of factors, IED can easily degrade researchers’ ability to identify which theoretical components are doing the leg work and which ones are failing, especially when compared to piecemeal testing (e.g., Birnbaum, Reference Birnbaum2008; Dunn & Rao, Reference Dunn and Rao2019; Kellen, Steiner, Davis-Stober, & Pappas, Reference Kellen, Steiner, Davis-Stober and Pappas2020). The recent application of IED to risky-choice modeling (Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) illustrates this concern, as it is unclear which specific circumstances are leading one choice model to outperform another (e.g., is context dependency driven by feedback?).
It is our judgment that there is no one best way to do science, and that attempts to tell scientists how to do their job, including IED, will slow and hinder progress. IED is solving a problem that does not exist and introduces a problem that science should do without.
After so many years observing the prosecution of p-values and everyday laboratory life, we are pleased to see a growing number of researchers turning their attention to critical matters such as theory development and experimentation (e.g., Proulx & Morey, Reference Proulx and Morey2021). But as we transition into these important new debates, it is crucial to avoid past intellectual excesses. In particular, we note a tendency to embrace passive technological solutions to problems of scientific inference and discovery that make little room for the kind of active theory building and critical thinking that in fact result in meaningful scientific advances (see Singmann et al., Reference Singmann, Kellen, Cox, Chandramouli, Davis-Stober, Dunn and Shiffrin2023). In this vein, we wish to express serious reservations regarding Almaatouq et al.'s critique.
The observation of puzzling, incongruent, and incommensurate results across studies is a common affair in the experimental sciences (see Chang, Reference Chang2004; Galison, Reference Galison1987; Hacking, Reference Hacking1983). Indeed, one of the central roles of experimentation is to “create, produce, refine and stabilize phenomena” (Hacking, Reference Hacking1983, p. 229), which is achieved through an iterative process that includes the ongoing improvement of experimental apparati (see Chang, Reference Chang2004; Trendler, Reference Trendler2009) and relevant variables (Jantzen, Reference Jantzen2021). This process was discussed long ago by Maxwell (Reference Maxwell and Niven1890/1965), who described it as removing the influence of “disturbing agents” from a “field of investigation.”
Looking back at the history of modern memory research, we can identify this process in the development of experimental tasks (e.g., recognition, cued recall) with clear procedures (study/test phases) and stimuli (e.g., high-frequency words). This process is also manifest in the resolution of empirical puzzles, such as the innumerous exceptions, incongruencies, and boundary conditions encountered by researchers in the search for the “laws of memory” (for a review, see Roediger, Reference Roediger2008). Far from insurmountable, these empirical puzzles have been continuously resolved through the interplay of tailored experiments and theories (e.g., Cox & Shiffrin, Reference Cox and Shiffrin2017; Hotaling, Donkin, Jarvstad & Newell, Reference Hotaling, Donkin, Jarvstad and Newell2022; Humphreys, Bain, & Pike, Reference Humphreys, Bain and Pike1989; Roediger & Blaxton, Reference Roediger, Blaxton, Gorfein and Hoffman1987; Seamon et al., Reference Seamon, Williams, Crowley, Kim, Langer, Orne and Wishengrad1995; Turner, Reference Turner2019; Vergauwe & Cowan, Reference Vergauwe and Cowan2015). More specifically, candidate theories are constructed to explain existing results by postulating constructs (e.g., “trace strength”) and specifying how those constructs are related to observables (e.g., “more study time leads to more trace strength which leads to faster response times”). These theories also specify what should not be relevant, thereby identifying potential confounding variables that future experiments should control. For an exemplary case, consider the domain of short-term memory, where we can find a large body of empirical phenomena (e.g., Oberauer et al., Reference Oberauer, Lewandowsky, Awh, Brown, Conway, Cowan and Ward2018) alongside explanatory accounts that can accommodate them (e.g., interference-based theories; see Lewandowsky, Oberauer, & Brown, Reference Lewandowsky, Oberauer and Brown2009).
Against this backdrop, it is difficult to find Almaatouq et al.'s critique convincing. On the one hand, they fail to explain the success of existing experimental practices (e.g., piecemeal testing) in domains such as human memory. On the other, their treatment case studies such as “group synergy,” which has amassed a wealth of conflicting findings, do not include any indication that the process described above has failed. This omission opens a number of possible explanations. For example, incongruent results may reflect experimental artifacts or hidden ceteris paribus clauses and other preconditions (Meehl, Reference Meehl1990, p. 109) – can we really say that these procedures have been thoroughly pursued? Alternatively, incongruent results could be a sign that those results should not be treated as part of the same “space” in the first place, that is, that they do not define a cohesive body of results that can be explained by a common theory.
Moving on to the actual proposal of integrated experiment design (IED), we find its potential contribution to be largely negative. Referring back to Maxwell's (Reference Maxwell and Niven1890/1964) description, what IED proposes is to allow “disturbing agents” back into the “field of investigation” as long as they are appropriately tagged and recorded. It is difficult to imagine how Newton's laws of motion could ever emerge from large-scale experiments evaluating different shapes of objects, velocities, viscosities, surface textures, and so on. Our main concerns with IED are summarized below:
(1) By placing a premium on commensurability, IED decreases the chances of new and unexpected findings (Shiffrin, Börner, & Stigler, Reference Shiffrin, Börner and Stigler2018).
(2) By shifting researchers’ resources toward the joint observation of a large number of factors, IED disrupts the piecemeal efforts in experimentation and theorization that illuminate the processes underlying human data generation. For instance, it makes it difficult to tell an important result from one caused by a confound (for discussions, see Garcia-Marques & Ferreira, Reference Garcia-Marques and Ferreira2011; Kellen, Reference Kellen2019; Shiffrin & Nobel, Reference Shiffrin and Nobel1997).
(3) IED turns existential-abductive reasoning on its head: Instead of developing explanatory constructs (e.g., model development) in response to existing covariational information, a construct would be assumed a priori in the form of an empty vessel, to be later infused by the results of an experiment manipulating factors presumably related to it. For instance, the construct “attention” would be identified with the experimental manipulations thought to be relevant to “attention.” This concern is materialized by the treatment of the so-called Moral Machine, a statistical model summarizing the observed relationships between moral judgments and a host of variables, as a bona fide theory of moral reasoning.
(4) By introducing a large number of factors, IED can easily degrade researchers’ ability to identify which theoretical components are doing the leg work and which ones are failing, especially when compared to piecemeal testing (e.g., Birnbaum, Reference Birnbaum2008; Dunn & Rao, Reference Dunn and Rao2019; Kellen, Steiner, Davis-Stober, & Pappas, Reference Kellen, Steiner, Davis-Stober and Pappas2020). The recent application of IED to risky-choice modeling (Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) illustrates this concern, as it is unclear which specific circumstances are leading one choice model to outperform another (e.g., is context dependency driven by feedback?).
It is our judgment that there is no one best way to do science, and that attempts to tell scientists how to do their job, including IED, will slow and hinder progress. IED is solving a problem that does not exist and introduces a problem that science should do without.
Financial support
David Kellen was supported by NSF CAREER Award ID 2145308.
Competing interest
None.