To be Direct or not: Reversing Likert Response Format Items

Jaime García-Fernández; Álvaro Postigo; Marcelino Cuesta; Covadonga González-Nuevo; Álvaro Menéndez-Aller; Eduardo García-Cueto

doi:10.1017/SJP.2022.20

To be Direct or not: Reversing Likert Response Format Items

Published online by Cambridge University Press: 10 October 2022

Jaime García-Fernández

Álvaro Postigo

Marcelino Cuesta

Covadonga González-Nuevo

Álvaro Menéndez-Aller

and

Eduardo García-Cueto

Show author details

Jaime García-Fernández*: Affiliation:
Universidad de Oviedo (Spain)
Álvaro Postigo: Affiliation:
Universidad de Oviedo (Spain)
Marcelino Cuesta: Affiliation:
Universidad de Oviedo (Spain)
Covadonga González-Nuevo: Affiliation:
Universidad de Oviedo (Spain)
Álvaro Menéndez-Aller: Affiliation:
Universidad de Oviedo (Spain)
Eduardo García-Cueto: Affiliation:
Universidad de Oviedo (Spain)
*: Correspondence concerning this article should be addressed to Jaime García-Fernández. Universidad de Oviedo. Facultad de Psicología. Plaza de Feijoo, S/N. 33003 Oviedo (Spain). E-mail: garciafernandezj@uniovi.es. Phone: +34–985104140.

Article contents

Abstract
Method
Results
Discussion
Footnotes
References

Rights & Permissions

Abstract

Likert items are often used in social and health sciences. However, the format is strongly affected by acquiescence and reversed items have traditionally been used to control this response bias, a controversial practice. This paper aims to examine how reversed items affect the psychometric properties of a scale. Different versions of the Grit-s scale were applied to an adult sample (N = 1,419). The versions of the scale had either all items in positive or negative forms, or a mix of positive and negative items. The psychometric properties of the different versions (item analysis, dimensionality and reliability) were analyzed. Both negative and positive versions demonstrated better functioning than mixed versions. However, the mean total scores did not vary, which is an example of how similar means could mask other significant differences. Therefore, we advise against using mixed scales, and consider the use of positive or negative versions preferable.

Keywords

Acquiescence Grit-S Likert scales reversed items

Information

Type: Research Article
Information: The Spanish Journal of Psychology , Volume 25 , 2022 , e24

DOI: https://doi.org/10.1017/SJP.2022.20 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright: © The Author(s), 2022. Published by Cambridge University Press on behalf of Universidad Complutense de Madrid and Colegio Oficial de Psicólogos de Madrid

Likert-type items (Likert, Reference Likert1932) are one of the most widely-used multiple-response question formats for assessing no cognitive variables. In this type of item, the participant selects an option from a group of alternatives ordered by the level of agreement with the item statement. Positive forms of items (also called direct or non-reversed items) give high scores when the participant has a high level in the assessed trait. Negative forms of items (reversed) give low scores when the participant has a high level in the trait. Items can be reversed either by adding a negation to the item statement, a technique known as reverse orientation (e.g., from “I consider myself a good person” to “I do not consider myself a good person”) or by using reverse wording using antonyms (e.g., from “I consider myself a good person” to “I consider myself a bad person”; Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018; van Sonderen et al., Reference van Sonderen, Sanderman and Coyne2013). In applied research, one question to ask when constructing a scale is whether the scale should include reversed items.

Reversed items aim to control one of the main response biases in self-report measures: Acquiescence (Navarro-González et al., Reference Navarro-González, Lorenzo-Seva and Vigil-Colet2016). Acquiescence is defined as the tendency to agree with an item statement, disregarding its content (Paulhus & Vazire, Reference Paulhus, Vazire, Robins, Fraley and Krueger2005). It is not a response set bias (like social desirability) but a response style bias (like inattention; van Sonderen et al., Reference van Sonderen, Sanderman and Coyne2013). Despite widespread use, psychometric research generally does not advise this practice (Podsakoff et al., Reference Podsakoff, MacKenzie and Podsakoff2012; Vigil-Colet et al., Reference Vigil-Colet, Navarro-González and Morales-Vives2020), although some authors do defend it, declaring that a small number of negative items may cause slower, more careful reading of items (Józsa & Morgan, Reference Józsa and Morgan2017).

Reversed items complicate cognitive processing of item statements (Marsh, Reference Marsh1986; Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018; van Sonderen et al., Reference van Sonderen, Sanderman and Coyne2013), hence they are not considered advisable (Irwing, Reference Irwing, Booth and Hughes2018; Lane et al., Reference Lane, Raymond and Haladyma2016; Moreno et al., Reference Moreno, Martínez and Muñiz2015; Muñiz & Fonseca-Pedrero, Reference Muñiz and Fonseca-Pedrero2019). Furthermore, reversed items have a differential effect on participants depending on their cultures (Wong et al., Reference Wong, Rindfleisch and Burroughs2003), personality traits (DiStefano & Motl, Reference DiStefano and Motl2009), intelligence, and linguistic performance (Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018). In addition, reversed items complicate inter-item correlation estimations (Navarro-González et al., Reference Navarro-González, Lorenzo-Seva and Vigil-Colet2016), diminish items’ discriminatory power (Chiavaroli, Reference Chiavaroli2017; Józsa & Morgan, Reference Józsa and Morgan2017), reduce scale reliability (Carlson et al., Reference Carlson, Wilcox, Chou, Chang, Yang, Blanchard, Marterella, Kuo and Clark2011), and produce different scores in positive and negative items. With regard to the latter, inverted items usually lead to higher scores once their scores are redirected (Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018; Vigil-Colet, Reference Vigil-Colet, Navarro-González and Morales-Vives2020), as people tend to disagree more with negative items than with direct ones (i.e., people may doubt whether they “finish every task they start”, but will probably disagree with the idea of “not finishing every task they start”). However, Solís Salazar (Reference Solís Salazar2015) found higher scores for positive items, even when negative items are redirected.

Another problem caused by the use of reversed items is having worse dimensionality indexes in essentially-unidimensional constructs. In fact, a psychological construct could even move from being unidimensional to having two method factors when positive and negative items are mixed—one factor for positive and another for negative items—(Essau et al., Reference Essau, Olaya, Anastassiou-Hadjicharalambous, Pauli, Gilvarry, Bray, O’Callaghan and Ollendick2012; Horan et al., Reference Horan, DiStefano and Motl2003; van Sonderen et al., Reference van Sonderen, Sanderman and Coyne2013; Woods, Reference Woods2006). The grit construct is an example of this issue. Grit is a trait based on perseverance combined with passion for accomplishing long-term goals (Duckworth, Reference Duckworth2016; Duckworth & Quinn, Reference Duckworth and Quinn2009). The best-known scale for grit assessment is Grit-S (Duckworth & Quinn, Reference Duckworth and Quinn2009), which is supposed to assess two dimensions (perseverance of effort and consistency of interest). Here, negative items make up the first factor, while the second is made up of positive ones. Recent research has shown that grit has a unidimensional structure, with the bidimensional model being caused by reversed items (Areepattamannil & Khine, Reference Areepattamannil and Khine2018; Gonzalez et al., Reference Gonzalez, Canning, Smyth and MacKinnon2020; Morell et al., Reference Morell, Yang, Gladstone, Turci Faust, Ponnock, Lim and Wigfield2021; Postigo et al., Reference Postigo, Cuesta, García-Cueto, Menéndez-Aller, González-Nuevo and Muñiz2021; Vazsonyi et al., Reference Vazsonyi, Ksinan, Ksinan Jiskrova, Mikuška, Javakhishvili and Cui2019). Therefore, some grit scales have been developed following the unidimensional hypothesis, such as the Oviedo Grit Scale (EGO; Postigo et al., Reference Postigo, Cuesta, García-Cueto, Menéndez-Aller, González-Nuevo and Muñiz2021).

Research on item redirection usually uses unidimensional scales to show the effects of inverse items (Solís Salazar, Reference Solís Salazar2015; Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018; Vigil-Colet, Reference Vigil-Colet, Navarro-González and Morales-Vives2020). However, reversed items in the Grit-S scale produced a method factor that had serious consequences in terms of the substantive conceptualization of the construct. Given this, we believe that demonstrating what effects reversed items have on the Grit-s scale may be interesting for grit researchers. Applied researchers may also benefit from a clear example of how item reversal may affect scales in terms of item properties, total scores, factor structures, and reliability. It is important to analyze all of these differences, because although some properties may not vary between groups, this does not mean that the remaining properties will behave in the same way.

Another interesting point is the effect that reversed items might have when the scale is related to other variables. Although there is much research about how item reversal affects internal consistency, reliability, and even total scores (as previously explained), we have not found any studies mentioning the effects negative items can have in correlations with other psychological constructs. Previous research on grit has reported that high levels of grit are related to low levels of neurotic disorders, such as anxiety or depression (Datu et al., Reference Datu, King, Valdez and Eala2019; Musumari et al., Reference Musumari, Tangmunkongvorakul, Srithanaviboonchai, Techasrivichien, Suguimoto, Ono-Kihara and Kihara2018). It would be interesting to see how this relationship (grit-neuroticism) may be affected by item reversal.

The present study examines whether item reversal in Likert response format items influences the psychometric properties of a grit scale (Grit-S) and the relationship with another variable (Neuroticism).

First, we aim to determine how item reversal affects the factorial structure of the scale. As a consequence of the methodological artifact, we would expect scales that mix both types of items to have a bidimensional structure (caused by a methodological artifact), and the positive and negative versions to have a unidimensional structure. The second objective is to analyze possible changes in the total score due to using reversed items. If negative items tend to have higher scores, the more negative items in a scale, the higher the total scores. Thus, we would expect the negative version to have higher total scores than the mixed or original versions, which would both also have higher scores than the positive version. Third, we aim to show how reliability is affected by item reversal. As negative items usually correlate between each other more than positive ones (Solís Salazar, Reference Solís Salazar2015), and because the Cronbach’s alpha (α) coefficient is based on these correlations, negative scales should have higher reliability coefficients than positive scales. In addition, mixing the two types of items can force a scale from being unidimensional to being bidimensional. This would worsen the reliability coefficients, which are conceived to be estimated on unidimensional scales. Finally, the fourth objective is to analyze how correlations with another variable are affected by the use of reversed items. As explained above, grit has an inverse relation with Neuroticism, so negative correlations with the Neuroticism subscale of the NEO Five Factor Inventory (NEO-FFI) are expected, and this relationship should be stronger for the more reliable scales.

Method

Participants

The study sample comprised 991 Spaniards who completed an online questionnaire. 103 participants were excluded because they demonstrated suspicious response behavior (i.e., taking too much or too little time to answer the questionnaire or leaving some items unanswered). This sample was complemented by another 531 participants from the same population who took part in a previous study where Grit-S scales were applied.

The final sample consisted of 1,419 participants divided into five groups (Table 1). As the table shows, the different groups had similar mean ages, sex ratios, and levels of educational qualifications. Most of the sample had completed university (66.8%), followed by those who finished high school (19.0%), vocational training (10.2%), and secondary/primary school (4.0%).

Table 1. Sample Groups Regarding the Answered Scale

Note. M = mean; SD = standard deviation; % studies = university studies/high school/vocational training/secondary or primary studies.

The sample size is adequate for Exploratory Factor Analysis as each group contains over 200 participants and the scales have no more than 10 five-point Likert items (Ferrando & Anguiano-Carrasco, Reference Ferrando and Anguiano-Carrasco2010).

Instruments

Grit-S. Grit-S (Duckworth & Quinn, Reference Duckworth and Quinn2009) is a scale with eight items assessing two dimensions (four items for each dimension): Perseverance of effort and consistency of interest. The items use a five-point Likert response format. We used the Spanish version by Arco-Tirado et al. (Reference Arco-Tirado, Fernández-Martín and Hoyle2018), in which Cronbach’s alpha = .77 for the consistency of interest dimension, Cronbach’s alpha = .48 for the perseverance of effort dimension and Cronbach’s alpha = .75 for the total score. This version of the scale has five inverted items (the original English scale has four), four of which are in the consistency of interest dimension. Another three versions of the scale were developed (positive, negative, and mixed—explained below). The reversal process was as follows: a group of seven experts in Psychometrics and Psychological Assessment created several alternative versions for each original item (positives or negatives depending on the original item) using the reversed wording technique. The main reason for using reversed wording instead of reversed orientation is that the second one is not recommended by previous research (Haladyma & Rodríguez, Reference Haladyma and Rodríguez2013; Irwing, Reference Irwing, Booth and Hughes2018; Muñiz & Fonseca-Pedrero, Reference Muñiz and Fonseca-Pedrero2019). Afterwards, the representativeness of each alternative version was discussed. The versions with a minimum consensus of six out of seven (86%) experts were selected for developing the different scale versions. Hence, we created the Grit-S positive (all items in direct form), Grit-S negative (all items reversed) and Grit-S mixed (half of the items were randomly selected and inverted, disregarding their dimension). Although the original Grit-S scale is already a mixed scale, the reversed items in the Grit-S mixed version were randomly selected, and the consistency of interest dimension contains more than solely reversed items.

The four Grit-S scale versions are shown in Table A1 (see Appendix). The structure of each scale is given in Table A2 (see Appendix).

Neuroticism subscale, NEO-FFI test. The NEO-FFI test (Costa & McCrae, Reference Costa and McCrae1985) is an inventory for assessing personality following the Big Five personality model. The Neuroticism subscale is composed of 12 Likert-type items with five response categories from completely disagree to completely agree. It was adapted to Spanish by Cordero et al., (Reference Cordero, Pamos and Seisdedos2008). The original Cronbach’s alpha coefficient for the scale was .90. In this study, we found a Cronbach’s alpha coefficient of .86.

Procedure

Each group completed one scale in an online survey platform. The participants were found through non-probabilistic convenience sampling. Data collection lasted 5 months. Participants completed the scale anonymously and voluntarily without any compensation. All participants gave their informed consent, and their anonymity was ensured according to Spanish data protection legislation, Organic Law 3/2018, de 5th December, on Individual Data Protection and the Guarantee of Digital Rights (Ley Orgánica 3/2018, de 5 de diciembre, de Protección de Datos Personales y garantía de los derechos digitales).

Data Analysis

Dimensionality

Several Exploratory Factor Analyses (EFA) were conducted in order to assess the dimensionality of the scales. When items have five or more response alternatives, and skewness and kurtosis are less than one, a Pearson correlation matrix is advised for factorial analysis (Lloret-Segura et al., Reference Lloret-Segura, Ferreres-Traver, Hernández-Baeza and Tomás-Marco2014). The suitability of the matrix for factorial analysis was assessed using the Kaiser-Meyer-Olkin (KMO) and the Bartlett statistic. KMO should be greater than .80 to ensure a feasible analysis (Kaiser & Rice, Reference Kaiser and Rice1974). Robust Unweighted Least Squares (RULS) was used as an estimation method. To decide on the number of extracted factors, we used an optimal implementation of Parallel Analysis (PA; Timmerman & Lorenzo-Seva, Reference Timmerman and Lorenzo-Seva2011). The feasibility of the factorial structure was assessed using total explained variance and the Comparative Fit Index (CFI). More precisely, to assess the suitability of the unidimensional structure, we estimated Explained Common Variance (ECV; Ferrando & Lorenzo-Seva, Reference Ferrando and Lorenzo-Seva2017). CFI should be greater than .95 (Hu & Bentler, Reference Hu and Bentler1999), and ECV greater than .80 (Calderón Garrido et al., Reference Calderón Garrido, Navarro González, Lorenzo Seva and Ferrando Piera2019). The factor loadings of the different versions were compared using the Wrigley and Neuhous congruence coefficient (García-Cueto, Reference García-Cueto1994).

Descriptive Statistics, Item Analysis and Differences in Scores

We calculated descriptive statistics (mean, standard deviation, skewness, kurtosis and discrimination index) for each item in each grit scale. The discrimination index should be higher than .20 to consider an item a good measure of the trait (Muñiz & Fonseca-Pedrero, Reference Muñiz and Fonseca-Pedrero2019). To verify if reversed items had significantly affected the total Grit-S scores, ANOVA between scale versions (original, positive, negative, mixed) was performed.

Reliability

Scale reliability was assessed using Cronbach’s alpha. We computed Feldt’s w statistic (Feldt, Reference Feldt1969) to assess whether there were significant differences between the reliability of the scales.

Descriptive statistics, ANOVA and the t-test were estimated using IBM SPSS Statistics (Version 24). Reliability and EFAs were assessed using FACTOR 12.01.01 (Lorenzo-Seva & Ferrando, Reference Lorenzo-Seva and Ferrando2013).

Correlations with Other Variables

Three versions of the Grit-S scale (positive, negative, mixed) were correlated with the Emotional Stability score of the NEO-FFI test. We could not estimate the correlation for the original version of the Grit-s as this sample did not complete the Emotional Stability scale.

Results

Dimensionality

A total of four EFAs were conducted, one for each version of the scale. Optimal Implementation of Parallel Analysis recommended one dimension in all versions of the scale (see Figure 1). Table 2 shows the KMO, Bartlett significance level, percentage of total explained variance, ECV and CFI for each version of the scale. Table 3 shows the comparisons between factorial loadings of the four Grit-S scales.

Figure 1. Results of the Optimal Implementation of Parallel Analysis

Table 2. Fit Indices of Exploratory Factor Analysis for Grit Scale Versions

Note. KMO = Kaiser-Meyer-Olkin statistic. ECV = Explained Common Variance. CFI = Comparative Fix Index. α = Cronbach’s α.

Table 3. Factorial Loadings Comparison of Grit Scale Versions

Note. (−) = negative; (+) = positive; (M) = mixed; (O) = original; r_c = Congruence coefficient.

The Grit-S negative version gave the best fit, followed by the positive, original, and mixed versions. The original and mixed versions did not reach the requirement established for KMO and ECV, thus indicating a bad fit to a unidimensional structure.

Descriptive Statistics, Item Analysis and Differences in Scores

Descriptive statistics for the items are shown in Table 4. The items from the versions of the Grit-S scale had means between 2.58–4.26 and standard deviations between 0.79–1.27. Apart from the kurtosis value for Item 2 (–1.00) and the skewness value for Item 5 (–1.28)—both from the negative Grit-S scale—all skewness and kurtosis indexes were between ±1.

Table 4. Descriptive Statistics of the Items

Note. M = mean; SD = standard deviation; sk = skewness; k= kurtosis; DI = Discrimination Index; FL = Factorial Loading; ^a = negative items.

Discrimination indexes were generally lower in the mixed versions and higher in the negative versions than the positive versions. Item 5 of the Grit-S scale demonstrated no discriminatory power (.00)

The ANOVA for the four versions of the Grit-S scale showed no significant differences between the total scores for the original, mixed, positive, and negative versions (F = 0.972; df = 3; p = .405).

Reliability

The reliability for each version of the scale is shown in Table 2. The original and mixed versions demonstrated the worst reliability. Reliability comparisons are shown in Table 5. The negative version of the Grit-S negative version had significantly better reliability than the other versions, and the positive version had better reliability than the original or mixed versions.

Table 5. Reliability Comparison of Grit Scale Versions

Note. (−) = negative; (+) = positive; (M) = mixed; (O) = original

Correlations with Other Variables

The Pearson correlations between Neuroticism and the grit scales were –.26 for the positive version, –.38 for the mixed version and –.53 for the negative version.

Discussion

Reversed items have been questioned by previous research for various reasons (Carlson et al., Reference Carlson, Wilcox, Chou, Chang, Yang, Blanchard, Marterella, Kuo and Clark2011; Chiavaroli, Reference Chiavaroli2017; Essau et al., Reference Essau, Olaya, Anastassiou-Hadjicharalambous, Pauli, Gilvarry, Bray, O’Callaghan and Ollendick2012; Navarro-González et al., Reference Navarro-González, Lorenzo-Seva and Vigil-Colet2016). The present study examined the effect of item reversion on a grit scale, as well as any potential consequences of that when relating the scale to other variables.

Looking at the dimensionality of the versions of the scale, EFA points to a unidimensional structure, similar to previous results (Areepattamannil & Khine, Reference Areepattamannil and Khine2018; Gonzalez et al., Reference Gonzalez, Canning, Smyth and MacKinnon2020; Postigo et al., Reference Postigo, Cuesta, García-Cueto, Menéndez-Aller, González-Nuevo and Muñiz2021), meaning that the hypothesis of a two-factor structure for mixed versions is refuted. However, the best fit indexes were found for the negative and positive versions, while the mixed versions (both mixed and original Grit-S scales) exhibited the worst unidimensional fit. In other words, the use of both positive and negative items promotes the multidimensionality of the scale (Essau et al., Reference Essau, Olaya, Anastassiou-Hadjicharalambous, Pauli, Gilvarry, Bray, O’Callaghan and Ollendick2012; Horan et al., Reference Horan, DiStefano and Motl2003; Woods, Reference Woods2006). This is not only a problem for the scale’s internal consistency, but can have serious consequences for the theoretical framework that researchers are developing, for example, conceptualizing more factors than necessary because of the method factor that negative items may produce. Continuing with factorial structure, the items’ factor loadings did not exhibit statistically significant differences between versions. This indicates that the factorial structure did not differ due to the use of reversed items, although this structure is less clear when using mixed scales (as they had worse fit indexes).

In the Grit-S scale, the negative version demonstrated greater reliability (α = .83) than the positive version (α = .77). This can be explained as due to the higher correlations between the negative items than between the positive items (Solís Salazar, Reference Solís Salazar2015). The positive version exhibited a higher reliability coefficient than the mixed and original versions. Finally, there were no statistically significant differences in reliability between the mixed and original versions, which was expected as both of these scales mix positive and negative items. This confirms previous findings about the reduced reliability coefficients when using mixed scales (Carlson et al., Reference Carlson, Wilcox, Chou, Chang, Yang, Blanchard, Marterella, Kuo and Clark2011).

There were no statistically significant differences between the versions with regard to the total scores. This refutes our second hypothesis, as our data did not replicate the results of previous findings (Suárez-Álvarez et al., Reference Suárez-Álvarez, Pedrosa, Lozano, García-Cueto, Cuesta and Muñiz2018; Vigil-Colet, Reference Vigil-Colet, Navarro-González and Morales-Vives2020). This could be seen as the grit scale being a “special case” due to its items (people tend to agree or disagree in the same way with negative and positive items when asked about their grit levels) or the length of the questionnaire, as previous research has shown these differences with questionnaires that are at least twice as long as the Grit-S scale. One might think that the scales could be used interchangeably, given that there were no mean differences between versions. We advise against this interpretation, as having the same mean does not imply that an individual would have the same score in both versions. As we mentioned previously, the quality of factorial scores worsens with mixed versions, as do the reliability coefficients, and these differences are statistically significant.

Another example of what might be masked by similar total mean scores is the change in the correlation coefficients with Neuroticism. By redirecting just half of the items, the correlation goes from –.26 to –.38 (a difference of 7.6 in the percentage of explained variance). If all items are redirected, that produces a correlation of –.53 (the percentage of explained variance grows by 21 points). This proves that redirecting items can have a powerful effect on the relationship with other variables. We believe that the reason for this difference is the increase in the variance of the total scores produced by negative items, which affects the correlation coefficient (Amón Hortelano, Reference Amón Hortelano1990). This may vary depending on the psychological construct being assessed (positive items may exhibit more variance than negative items for a different variable).

The results of this study should be assessed in light of some limitations. First, using a cross-sectional design, with different samples responding to each scale, could have biased the results, although the groups did have similar sociodemographic characteristics. In this regard, future studies should apply longitudinal designs. Secondly, the possibility of developing a “perfect-inverted item” is unclear, given semantic, grammatical and/or expressive issues. Some reversed expressions may sound ‘weird’ to a native speaker, leading to grammatical changes that make the sentence clearer but further from being a precise reversed version of the original item. This is not only a limitation for the present study, but also another argument against the use of reversed items in scale development.

Applied researchers should avoid developing mixed scales. Note that the problems with negative items come when they are included in a scale along with positive items (i.e., mixed scales). Having an entirely negative scale—with properly constructed items—cannot be considered bad practice, as this study shows. Thus, researchers should select which form (positive or negative) they prefer considering the theoretical framework of the construct. It is also important to note that having the same mean total scores does not mean that the compared scales are equivalent, as the factorial structure, reliability, and the relationship with other variables may differ significantly.

Appendix

Table A1. Positive and Negative Items for Grit-s Scales

Note. ^a = original Grits-S item.

Table A2. Item Direction (Positive or Negative in Each Scale Versions)

Note. (O) = original version; (M) = mixed version; (+) = positive version; (–) = negative version.

Footnotes

Funding statement: This investigation was supported by a predoctoral grant from the Universidad de Oviedo (PAPI–21–PF–24).

Conflicts of Interest: None.

References

Amón Hortelano, J. (1990). Estadística para psicólogos: Estadística descriptiva [Statistics for psychologists: Descriptive statistics] (12^th Ed.). Pirámide.Google Scholar

Arco-Tirado, J. L., Fernández-Martín, F. D., & Hoyle, R. H. (2018). Development and validation of a Spanish version of the Grit-S scale. Frontiers in Psychology, 9, Article 96. https://doi.org/10.3389/fpsyg.2018.00096 CrossRef Google Scholar PubMed

Areepattamannil, S., & Khine, M. S. (2018). Evaluating the psychometric properties of the original Grit Scale using rasch analysis in an Arab adolescent sample. Journal of Psychoeducational Assessment, 36(8), 856–862. http://doi.org/10.1177/0734282917719976 Google Scholar

Carlson, M., Wilcox, R., Chou, C.-P., Chang, M., Yang, F., Blanchard, J., Marterella, A., Kuo, A., & Clark, F. (2011). Psychometric properties of reverse-scored items on the CES-D in a sample of ethnically diverse older adults. Psychological Assessment, 23(2), 558–562. https://doi.org/10.1037/a0022484 Google Scholar

Chiavaroli, N. (2017). Negatively-worded multiple choice questions: An avoidable threat to validity. Practical Assessment, Research and Evaluation, 22, Article 3. https://doi.org/10.7275/ca7y-mm27 Google Scholar

Cordero, A., Pamos, A., & Seisdedos, N. (2008). Inventario de Personalidad Neo Revisado (NEO PI-R), Inventario Neo Reducido de Cinco Factores (NEO-FFI): Manual profesional [The Neo Personality Inventory Revised (NEO-PI-R), the Reduced Five Factor Personality Inventory (NEO-PI-R): Professional manual] (3^rd Ed.). TEA.Google Scholar

Costa, P. T., & McCrae, R. R. (1985). The NEO personality inventory manual. Psychological Assessment Resources.Google Scholar

Datu, J. A. D., King, R. B., Valdez, J. P. M., & Eala, M. S. M. (2019). Grit is associated with lower depression via meaning in life among Filipino high school students. Youth & Society, 51(6), 865–876. https://doi.org/10.1177/0044118x18760402 Google Scholar

DiStefano, C., & Motl, R. W. (2009). Personality correlates of method effects due to negatively worded items on the Rosenberg Self-Esteem scale. Personality and Individual Differences, 46(3), 309–313. https://doi.org/10.1016/j.paid.2008.10.020 CrossRef Google Scholar

Duckworth, A. (2016). Grit: The power of passion and perseverance. Scribner/Simon & Schuster.Google Scholar

Duckworth, A. L., & Quinn, P. D. (2009). Development and validation of the Short Grit Scale (Grit–S). Journal of Personality Assessment, 91(2), 166–174. https://doi.org/10.1080/00223890802634290 Google Scholar PubMed

Essau, C. A., Olaya, B., Anastassiou-Hadjicharalambous, X., Pauli, G., Gilvarry, C., Bray, D., O’Callaghan, J., & Ollendick, T. H. (2012). Psychometric properties of the Strength and Difficulties Questionnaire from five European countries. International Journal of Methods in Psychiatric Research, 21(3), 232–245. https://doi.org/10.1002/mpr.1364 CrossRef Google Scholar PubMed

Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34(3), 363–373. https://doi.org/10.1007/BF02289364 Google Scholar

Ferrando, P. J., & Anguiano-Carrasco, C. (2010). El análisis factorial como técnica de investigación en psicología [Factor analysis as a research technique in psychology]. Papeles del Psicólogo, 31(1), 18–33.Google Scholar

Ferrando, P. J., & Lorenzo-Seva, U. (2017). Assessing the quality and appropriateness of factor solutions and factor score estimates in exploratory item factor analysis. Educational and Psychological Measurement, 78(5), 762–780. https://doi.org/10.1177/0013164417719308 Google Scholar PubMed

García-Cueto, E. (1994). Coeficiente de congruencia [Congruence coefficient]. Psicothema, 6(3), 465–468.Google Scholar

Calderón Garrido, C., Navarro González, D., Lorenzo Seva, U., & Ferrando Piera, P. J. (2019). Multidimensional or essentially unidimensional? A multi-faceted factor-analytic approach for assessing the dimensionality of tests and items. Psicothema, 31(4), 450–457. https://doi.org/10.7334/psicothema2019.153 Google Scholar PubMed

Gonzalez, O., Canning, J. R., Smyth, H., & MacKinnon, D. P. (2020). A psychometric evaluation of the Short Grit Scale. European Journal of Psychological Assessment, 36(4), 646–657. https://doi.org/10.1027/1015-5759/a000535 CrossRef Google Scholar PubMed

Haladyma, T. M., & Rodríguez, M. C. (2013). Developing and validating test items. Taylor & Francis. https://doi.org/10.4324/9780203850381 Google Scholar

Horan, P. M., DiStefano, C., & Motl, R. W. (2003). Wording effects in self-esteem scales: Methodological artifact or response style? Structural Equation Modeling: A Multidisciplinary Journal, 10(3), 435–455. https://doi.org/10.1207/S15328007SEM1003_6 CrossRef Google Scholar

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55. https://doi.org/10.1080/10705519909540118 CrossRef Google Scholar

Irwing, P., Booth, T., & Hughes, D. J. (2018). The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development. John Wiley & Sons. https://doi.org/10.1002/9781118489772 CrossRef Google Scholar

Józsa, K., & Morgan, G. A. (2017). Reversed items in likert scales: Filtering out invalid responders. Journal of Psychological and Educational Research, 25(1), 7–25.Google Scholar

Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark Iv. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115 Google Scholar

Lane, S., Raymond, M. R., & Haladyma, T. M. (2016). Handbook of test development (2^nd Rd.). Routledge. https://doi.org/10.4324/9780203102961 Google Scholar

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 55.Google Scholar

Lloret-Segura, S., Ferreres-Traver, A., Hernández-Baeza, A., & Tomás-Marco, I. (2014). El análisis factorial exploratorio de los ítems: Una guía práctica, revisada y actualizada [Exploratory Item Factor Analysis: A practical guide revised and updated]. Anales de Psicología, 30(3). https://doi.org/10.6018/analesps.30.3.199361 CrossRef Google Scholar

Lorenzo-Seva, U., & Ferrando, P. J. (2013). FACTOR 9.2. Applied Psychological Measurement, 37(6), 497–498. https://doi.org/10.1177/0146621613487794 Google Scholar

Marsh, H. W. (1986). Negative item bias in ratings scales for preadolescent children: A cognitive-developmental phenomenon. Developmental Psychology, 22(1), 37–49. https://doi.org/10.1037/0012-1649.22.1.37 CrossRef Google Scholar

Morell, M., Yang, J. S., Gladstone, J. R., Turci Faust, L., Ponnock, A. R., Lim, H. J., & Wigfield, A. (2021). Grit: The long and short of it. Journal of Educational Psychology, 113(5), 1038–1058. https://doi.org/10.1037/edu0000594 CrossRef Google Scholar

Moreno, R., Martínez, R. J., & Muñiz, J. (2015). Guidelines based on validity criteria for the development of multiple choice items. Psicothema, 27(4), 388–394. https://doi.org/10.7334/psicothema2015.110 Google Scholar PubMed

Muñiz, J., & Fonseca-Pedrero, E. (2019). Diez pasos para la construcción de un test [Ten steps for test development]. Psicothema, 31(1), 7–16. https://doi.org/10.7334/psicothema2018.291 Google Scholar

Musumari, P. M., Tangmunkongvorakul, A., Srithanaviboonchai, K., Techasrivichien, T., Suguimoto, S. P., Ono-Kihara, M., & Kihara, M. (2018). Grit is associated with lower level of depression and anxiety among university students in Chiang Mai, Thailand: A cross-sectional study. PLOS ONE, 13(12), Article e0209121. https://doi.org/10.1371/journal.pone.0209121 Google Scholar

Navarro-González, D., Lorenzo-Seva, U., & Vigil-Colet, A. (2016). Efectos de los sesgos de respuesta en la estructura factorial de los autoinformes de personalidad [How response bias affects the factorial structure of personality self-reports]. Psicothema, 28(4), 465–470. https://doi.org/10.7334/psicothema2016.113 Google Scholar

Ley Orgánica 3/2018, de 5 de diciembre, de Protección de Datos Personales y garantía de los derechos digitales [Organic Law 3/2019, of December 5, on the protection of personal data and guarantee of digital rights] (2018, December 6^th). Boletín Oficial del Estado, 294, Sec. I, pp. 119788–119857. https://www.boe.es/eli/es/lo/2018/12/05/3/dof/spa/pdf Google Scholar

Paulhus, D. L., & Vazire, S. (2005). The self-report method. In Robins, R. W., Fraley, R. C., & Krueger, R. F. (Eds.), Handbook of research methods in personality psychology (pp. 224–239). Guildford Press.Google Scholar

Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). Sources of method bias in social science research and recommendations on how to control it. Annual Review of Psychology, 63(1), 539–569. https://doi.org/10.1146/annurev-psych-120710-100452 CrossRef Google Scholar PubMed

Postigo, Á., Cuesta, M., García-Cueto, E., Menéndez-Aller, Á., González-Nuevo, C., & Muñiz, J. (2021). Grit assessment: Is one dimension enough? Journal of Personality Assessment, 103(6), 786–796. https://doi.org/10.1080/00223891.2020.1848853 CrossRef Google Scholar

Solís Salazar, M. (2015). The dilemma of combining positive and negative items in scales. Psicothema, 27(2), 192–199. https://doi.org/10.7334/psicothema2014.266 Google Scholar PubMed

Suárez-Álvarez, J., Pedrosa, I., Lozano, L. M., García-Cueto, E., Cuesta, M., & Muñiz, J. (2018). Using reversed items in Likert scales: A questionable practice. Psicothema, 30(2), 149–158. https://doi.org/10.7334/psicothema2018.33 Google Scholar

Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209–220. https://doi.org/10.1037/a0023353 CrossRef Google Scholar PubMed

van Sonderen, E., Sanderman, R., & Coyne, J. C. (2013). Ineffectiveness of reverse wording of questionnaire items: Let’s learn from cows in the rain. PLOS ONE, 8(7), Article e68967. https://doi.org/10.1371/journal.pone.0068967 Google Scholar PubMed

Vazsonyi, A. T., Ksinan, A. J., Ksinan Jiskrova, G., Mikuška, J., Javakhishvili, M., & Cui, G. (2019). To grit or not to grit, that is the question! Journal of Research in Personality, 78, 215–226. https://doi.org/10.1016/j.jrp.2018.12.006 Google Scholar

Vigil-Colet, A., Navarro-González, D., & Morales-Vives, F. (2020). To reverse or to not reverse likert-type items: That is the question. Psicothema, 32(1), 108–114. https://doi.org/10.7334/psicothema2019.286 Google Scholar

Wong, N., Rindfleisch, A., & Burroughs, J. E. (2003). Do reverse-worded items confound measures in cross-cultural consumer research? The case of the material values scale. Journal of Consumer Research, 30(1), 72–91. https://doi.org/10.1086/374697 CrossRef Google Scholar

Woods, C. M. (2006). Careless responding to reverse-worded items: Implications for confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment, 28(3), 186–191. https://doi.org/10.1007/s10862-005-9004-7 CrossRef Google Scholar