Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-x24gv Total loading time: 0 Render date: 2024-06-07T06:56:25.616Z Has data issue: false hasContentIssue false

References

Published online by Cambridge University Press:  22 February 2024

Sandip Sinharay
Affiliation:
Educational Testing Service, New Jersey
Richard A. Feinberg
Affiliation:
National Board of Medical Examiners, Pennsylvania
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Subscores
A Practical Guide to Their Production and Consumption
, pp. 158 - 168
Publisher: Cambridge University Press
Print publication year: 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ackerman, T., & Shu, Z. (2009). Using confirmatory MIRT modeling to provide diagnostic information in large scale assessment. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA.Google Scholar
ACT. (2022). ACT technical manual. Iowa City, IA: ACT.Google Scholar
Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22(1), 4776. https://doi.org/10.3102/10769986022001047CrossRefGoogle Scholar
Albanese, M. A. (2014). The testing column: Differences in subject area subscores on the MBE and other illusions. The Bar Examiner, 83(2), 2631.Google Scholar
Almond, R., Steinberg, L., & Mislevy, R. (2002). Enhancing the design and delivery of assessment systems: A four-process architecture. The Journal of Technology, Learning and Assessment, 1(5). https://ejournals.bc.edu/index.php/jtla/article/view/1671Google Scholar
American Board of Internal Medicine Maintenance of Certification (ABIM MOC). (2023). Enhanced score report. www.abim.org/Media/f4pp1das/score-report.pdfGoogle Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In Thorndike, R. L. (Ed.), Educational measurement (pp. 508600). Washington, DC: American Council on Education.Google Scholar
Armed Services Vocational Aptitude Battery (ASVAB). (2023). Understanding your ASVAB results. www.asvabprogram.com/media-center-article/28Google Scholar
Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191204. https://doi.org/10.2307/1165169Google Scholar
Bell, R., & Lumsden, J. (1980). Test length and validity. Applied Psychological Measurement, 4(2), 165170. https://doi.org/10.1177/014662168000400203CrossRefGoogle Scholar
Bertin, J. (1983). Semiology of graphics: Diagrams, networks, maps (Translated into English by Berg, W. J.). Madison: University of Wisconsin Press.Google Scholar
Biancarosa, G., Kennedy, P. C., Carlson, S. E., Yoon, H., Seipel, B., Liu, B., & Davison, M. L. (2019). Constructing subscores that add validity: A case study of identifying students at risk. Educational and Psychological Measurement, 79(1), 6584. https://doi.org/10.1177/0013164418763255CrossRefGoogle ScholarPubMed
Brennan, R. L. (2012). Utility indexes for decisions about subscores. CASMA Research Report 33. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.Google Scholar
Brinton, W. C. (1939). Graphic presentations. New York: Brinton.Google Scholar
Brown, G. T. L., O’Leary, T. M., & Hattie, J. A. C. (2019). Effective reporting for formative assessment: The asTTle case example. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 107125). New York: Routledge. https://doi.org/10.4324/9781351136501-11Google Scholar
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.xGoogle Scholar
Bulut, O., Davison, M. L., & Rodriguez, M. C. (2017). Estimating between-person and within-person subscore reliability with profile analysis. Multivariate Behavioral Research, 52(1), 86104. https://doi.org/10.1080/00273171.2016.1253452CrossRefGoogle ScholarPubMed
Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 8387. https://doi.org/10.2307/2682801Google Scholar
Choi, I., & Papageorgiou, S. (2020). Evaluating subscore uses across multiple levels: A case of reading and listening subscores for young EFL learners. Language Testing, 37(2), 254279. https://doi.org/10.1177/0265532219879654CrossRefGoogle Scholar
Comprehensive Clinical Science Examination (CCSE). (2023). Examinee performance report. www.nbme.org/sites/default/files/2022-12/CCSE_Examinee_Performance_Report_2022.pdfGoogle Scholar
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297334. https://doi.org/10.1007/bf02310555CrossRefGoogle Scholar
Cronbach, L. J., Schönemann, P., & McKie, D. (1965). Alpha coefficients for stratified-parallel tests. Educational and Psychological Measurement, 25, 291312. https://doi.org/10.1177/001316446502500201CrossRefGoogle Scholar
CTB/McGraw-Hill. (2001). TerraNova, the second edition: Individual profile report. Monterey, CA: Author.Google Scholar
Dai, S., Svetina, D., & Wang, X. (2017). Reporting subscores using R: A software review. Journal of Educational and Behavioral Statistics, 42, 617638. https://doi.org/10.3102/1076998617716462CrossRefGoogle Scholar
Dai, S., Wang, X., & Svetina, D. (2019). Subscore: Sub-score computing functions in classical test theory (R Package Version 3.1) [Computer Software]. http://CRAN.R-project.org/package=subscoreGoogle Scholar
Davison, M. L., Davenport, E. C., Chang, Y.-F., Vue, K., & Su, S. (2015). Criterion-related validity: Assessing the value of subscores. Journal of Educational Measurement, 52, 263279. https://doi.org/10.2307/43940571CrossRefGoogle Scholar
DiBello, L. V., Roussos, L., & Stout, W. F. (2006). Review of cognitive diagnostic assessment and a summary of psychometric models. In Rao, C. R., & Sinharay, S. (Eds.), Handbook of statistics, Volume 26 (pp. 9791030). Amsterdam: Elsevier Science B.V. https://doi.org/10.1016/s0169-7161(06)26031-0Google Scholar
Dorans, N. J., & Walker, M. E. (2007). Sizing up linkages. In Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.), Linking and aligning scores and scales (pp. 179198). New York: Springer. https://doi.org/10.1007/978-0-387-49771-6_10CrossRefGoogle Scholar
Duolingo English Test. (2023). Sample certificate. https://englishtest.duolingo.com/sample_certificateGoogle Scholar
Draper, N. R., & Smith, H. (1998). Applied regression analysis. New York: Wiley. https://doi.org/10.1002/9781118625590CrossRefGoogle Scholar
DuBois, P. H. (1970). A history of psychological testing. Boston: Allyn & Bacon.Google Scholar
Dwyer, A., Boughton, K. A., Yao, L., Steffen, M., & Lewis, D. (2006, April). A comparison of subscale score augmentation methods using empirical data. Paper presented at the meeting of the National Council on Measurement in Education, San Francisco, CA.Google Scholar
Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 1525. https://doi.org/10.1177/001316446202200103CrossRefGoogle Scholar
Educational Testing Service. (2008). PraxisTM 2008–09 information bulletin. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service. (2020). TOEFL® Research insight series, Volume 3: Reliability and comparability of TOEFL iBT® scores. Princeton, NJ: Author.Google Scholar
Educational Testing Service. (2021). The Praxis study companion, elementary education: Content knowledge. Princeton, NJ: Educational Testing Service.Google Scholar
Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31, 241259. https://doi.org/10.3102/10769986031003241CrossRefGoogle Scholar
Everitt, B. (2011). Cluster analysis. Chichester, UK: Wiley.CrossRefGoogle Scholar
Every Student Succeeds Act, 20 U.S.C. § 6301 (2015). www.congress.gov/bill/114th-congress/senate-bill/1177Google Scholar
Feinberg, R. A., & Clauser, A. L. (2016). Can item keyword feedback help remediate knowledge gaps? Journal of Graduate Medical Education, 8(4), 541545. https://doi.org/10.4300/jgme-d-15-00463.1CrossRefGoogle ScholarPubMed
Feinberg, R. A., & Jurich, D. P. (2017). Guidelines for interpreting and reporting subscores. Educational Measurement: Issues and Practice, 36(1), 513. https://doi.org/10.1111/emip.12142CrossRefGoogle Scholar
Feinberg, R. A., & von Davier, M. (2020). Conditional subscore reporting using the compound binomial distribution. Journal of Educational and Behavioral Statistics, 45(5), 515533. https://doi.org/10.3102/1076998620911933CrossRefGoogle Scholar
Feinberg, R. A., & Wainer, H. (2011). Extracting sunbeams from cucumbers. Journal of Computational and Graphical Statistics, 20(4), 793810. https://doi.org/10.1198/jcgs.2011.204aCrossRefGoogle Scholar
Feinberg, R. A., & Wainer, H. (2014). When can we improve subscores by making them shorter? The case against subscores with overlapping items. Educational Measurement: Issues and Practice, 33(3), 4754. https://doi.org/10.1111/emip.12037CrossRefGoogle Scholar
Feinberg, R. A., & Wainer, H. (2014). A simple equation to predict a subscore’s value. Educational Measurement: Issues and Practice, 33(3), 5556. https://doi.org/10.1111/emip.12035CrossRefGoogle Scholar
Flanagan, J. C. (1948). The aviation psychology program in the Army Air Forces. Report 1, AAF Aviation Psychology Program Research Reports, US Government Printing Office, pp. xii+316.Google Scholar
Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31, 651659. https://doi.org/10.2307/2529549CrossRefGoogle ScholarPubMed
Friendly, M., & Wainer, H. (2021). A history of data visualization and graphic communication. Cambridge, MA: Harvard University Press. https://doi.org/10.4159/9780674259034Google Scholar
George, A. C., Robitzsch, A., Kiefer, T., Gross, J., & Uenlue, A. (2016). The R package CDM for cognitive diagnosis models. Journal of Statistical Software, 74(2), 124. https://doi.org/10.18637/jss.v074.i02CrossRefGoogle Scholar
Goodman, D. P., & Hambleton, R. K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17, 145220. https://doi.org/10.1207/s15324818ame1702_3CrossRefGoogle Scholar
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Part I. Journal of the American Statistical Association, 49, 732764. https://doi.org/10.2307/2281536Google Scholar
Haberman, S. J. (2008a). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204229. https://doi.org/10.3102/1076998607302636CrossRefGoogle Scholar
Haberman, S. J. (2008b). Subscores and validity. ETS Research Report Series (ETS Research Report No. RR-08-64). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2008.tb02150.xCrossRefGoogle Scholar
Haberman, S. J. (2008c). Outliers in assessments. ETS Research Report Series (ETS Research Report No. RR-08-41). https://doi.org/10.1002/j.2333-8504.2008.tb02150.xCrossRefGoogle Scholar
Haberman, S. J. (2013). A general program for item-response analysis that employs the stabilized Newton-Raphson algorithm. ETS Research Report Series (ETS Research Report No. RR-13-32). https://doi.org/10.1002/j.2333-8504.2013.tb02339.xCrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209227. https://doi.org/10.1007/s11336-010-9158-4CrossRefGoogle Scholar
Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores? British Journal of Mathematical and Statistical Psychology, 66, 451469. https://doi.org/10.1111/j.2044-8317.2012.02061CrossRefGoogle ScholarPubMed
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 7995. https://doi.org/10.1348/000711007x248875CrossRefGoogle ScholarPubMed
Haberman, S. J., & von Davier, M. (2007). Some notes on models for cognitively based skills diagnosis. In Rao, C. R. & Sinharay, S. (Eds.), Handbook of statistics, Vol. 26 (pp. 10311038). Amsterdam: Elsevier North-Holland. https://doi.org/10.1016/s0169-7161(06)26040-1Google Scholar
Haberman, S., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223251. https://doi.org/10.1111/jedm.12075CrossRefGoogle Scholar
Haberman, S. J, Yao, L, & Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363–85. https://doi.org/10.1111/bmsp.12052CrossRefGoogle ScholarPubMed
Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation and the Health Professions, 27(4), 349368. https://doi.org/10.1177/0163278704270010CrossRefGoogle ScholarPubMed
Hambleton, R. K., & Zenisky, A. L. (2013). Reporting test scores in more meaningful ways: A research-based approach to score report design. In Geisinger, K. F. (Ed.), APA handbook of testing and assessment in psychology: Vol. 3. Testing and assessment in school psychology and education (pp. 479494). Washington, DC: American Psychological Association. https://doi.org/10.1037/14049-023CrossRefGoogle Scholar
Harris, D. J., & Hanson, B. A. (1991, April). Methods of examining the usefulness of subscores. Paper presented at the meeting of the National Council on Measurement in Education, Chicago, IL.Google Scholar
Hegarty, M. (2019). Advances in cognitive science and information visualization. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 1934). New York: Routledge. https://doi.org/10.4324/9781351136501-4Google Scholar
Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In Leighton, J., & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 1960). Cambridge: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.002CrossRefGoogle Scholar
Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258272. https://doi.org/10.1177/01466210122032064CrossRefGoogle Scholar
Kelley, T. L. (1923). Statistical method. New York: Macmillan.Google Scholar
Kibby, M. W. (1981). Test review: The degrees of reading power. Journal of Reading, 24(5), 416427. www.jstor.org/stable/40032381Google Scholar
Kolstad, A., Cohen, J., Baldi, S., Chan, T., DeFur, E., & Angeles, J. (1998). The response probability convention used in reporting data from IRT assessment scales: Should NCES adopt a standard? Washington, DC: American Institutes for Research.Google Scholar
LaFlair, G. T. (2020). Duolingo English test: Subscores (Duolingo Research Report No. DRR-20-03). Duolingo.Google Scholar
Lane, S., Raymond, M. R., Haladyna, T. M., & Downing, S. M. (2015). Test development process. In Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.), Handbook of test development (2nd ed., pp. 318). New York, NY: Routledge.CrossRefGoogle Scholar
Lazer, S., Mazzeo, J., & Weiss, A. with Campbell, J., Casalaina, L., Horkay, N., Kaplan, B., & Rogers, A. (2001). Final report on enhanced achievement level reporting and scale anchoring activities. Unpublished report prepared on behalf of the National Assessment Governing Board.Google Scholar
Leighton, J. P., & Gierl, M. J. (2007). Cognitive diagnostic assessment for education: Theory and applications. New York: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186CrossRefGoogle Scholar
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205237. https://doi.org/10.1111/j.1745-3984.2004.tb01163.xCrossRefGoogle Scholar
Lim, E., & Lee, W. (2020). Subscore equating and profile reporting. Applied Measurement in Education, 33, 95112. https://doi.org/10.1080/08957347.2020.1732381CrossRefGoogle Scholar
Ling, G. (2012). Why the major field test in business does not report subscores – Reliability and construct validity evidence (ETS Research Report No. RR-08-64). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2012.tb02293.x.CrossRefGoogle Scholar
Liu, Y., Robin, F., Yoo, H., & Manna, V. (2018). Statistical properties of the GRE® psychology test subscores. ETS Research Report Series. https://doi.org/10.1002/ets2.12206CrossRefGoogle Scholar
Longabach, T., & Peyton, V. A. (2018). Comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment. Language Testing, 35, 297317. https://doi.org/10.1177/0265532217689949CrossRefGoogle Scholar
Longford, N. T. (1990). Multivariate variance component analysis: An application in test development. Journal of Educational Statistics, 15, 91112. https://doi.org/10.2307/1164764CrossRefGoogle Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
Lord, F. M., & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8, 453461. https://doi.org/10.1177/014662168400800409CrossRefGoogle Scholar
Lovett, B. J., & Harrison, A. G. (2021). De-implementing inappropriate accommodations practices. Canadian Journal of School Psychology, 36(2), 115126. https://doi.org/10.1177/0829573520972556CrossRefGoogle Scholar
Luecht, R. (2007). Using information from multiple-choice distractors to enhance cognitive-diagnostic score reporting. In Leighton, J. & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 319340). Cambridge: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.011CrossRefGoogle Scholar
Luecht, R. (2013). Assessment engineering task model maps: Task models and templates as a new way to develop and implement test specifications. Journal of Applied Testing Technology, 14, 138.Google Scholar
Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006, April). Scalability and the development of useful diagnostic scales. Paper presented at the annual Meeting of the National Council on Measurement in Education, San Francisco, CA.Google Scholar
Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment, Research, and Evaluation, 14, 110.Google Scholar
Margolis, M. J., Clauser, B. E., Winward, M., & Dillon, G. F. (2010). Validity evidence for USMLE examination cut scores: Results of a large-scale survey. Academic Medicine, 85(10), 9397. https://doi.org/10.1097/acm.0b013e3181ed4028CrossRefGoogle ScholarPubMed
McDermott, P. A., Glutting, J. J., Jones, J. N., Watkins, M. W., & Kush, J. (1989). Core profile types in the WISC-R national sample: Structure, membership, and applications. Psychological Assessment: A Journal of Consulting and Clinical Psychology, 1, 292299. https://doi.org/10.1037/1040-3590.1.4.292CrossRefGoogle Scholar
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (Ed.), Frontiers in econometrics (pp. 105142). New York: Academic Press.Google Scholar
Meijer, R. R., Boevé, A. J., Tendeiro, J. N., Bosker, R. J., & Albers, C. J. (2017). The use of subscores in higher education: When is this useful? Frontiers in Psychology, 8, 16. https://doi.org/10.3389/fpsyg.2017.00305CrossRefGoogle ScholarPubMed
Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician, 54(1), 1724. https://doi.org/10.2307/2685605Google Scholar
Mertler, C. A. (2018). Norm-referenced interpretation. In Frey, B. (Ed.), The SAGE encyclopedia of educational research, measurement, and evaluation (pp. 11611163). Thousand Oaks, CA: SAGE. https://doi.org/10.4135/9781506326139.n478Google Scholar
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 367. https://doi.org/10.1207/S15366359MEA0101_02Google Scholar
Morey, L. C. (2004). The Personality Assessment Inventory (PAI). In Maruish, M. E. (Ed.), The use of psychological testing for treatment planning and outcomes assessment: Instruments for adults (pp. 509551). Mahwah, NJ: Lawrence Erlbaum Associates Publishers. https://doi.org/10.4324/9781410610614Google Scholar
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159176. https://doi.org/10.1177%2F014662169201600206CrossRefGoogle Scholar
National Assessment of Educational Progress (NAEP). (2023). Student groups. https://nces.ed.gov/nationsreportcard/guides/groups.aspxGoogle Scholar
New York State Testing Program (NYSTP). (2023). NYS grades 3-8 2021 technical report. www.nysed.gov/common/nysed/files/programs/state-assessment/3-8-technical-report-2021w.pdfGoogle Scholar
Paolino, J. (2020). Teaching linear correlation using contour plots. Teaching Statistics, 43(1), 1320. https://doi.org/10.1111/test.12239CrossRefGoogle Scholar
Papageorgiou, S., & Choi, I. (2018). Adding value to second-language listening and reading subscores: Using a score augmentation approach. International Journal of Testing, 18, 207230. https://doi.org/10.1080/15305058.2017.1407766CrossRefGoogle Scholar
Pashler, H., Cepeda, N. J., Wixted, J. T., & Rohrer, D. (2005). When does feedback facilitate learning of words? Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(1), 38. https://doi.org/10.1037/0278-7393.31.1.3Google ScholarPubMed
Pearson Longman. (2010). The official guide to PTE: Pearson test of English academic. Hong Kong SAR: Pearson Longman Asia ELT.Google Scholar
Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28(3), 513. https://doi.org/10.1080/01619561003685304CrossRefGoogle Scholar
Personality Assessment Inventory (PAI). (2023). The PAI police and public safety selection report. https://post.ca.gov/portals/0/post_docs/publications/psychological-screening-manual/PAI_PolicePubSftyRpt.pdfGoogle Scholar
Pieper Bar Review. (2017). Bar examiners to provide (slightly) more information to candidates who fail the bar exam. http://news.pieperbar.com/bar-examiners-to-provide-slightly-more-information-to-candidates-who-fail-the-bar-examGoogle Scholar
Praxis. (2023). Interpreting Your Praxis® Test Taker Score Report. www.ets.org/s/praxis/pdf/sample_score_report.pdfGoogle Scholar
Puhan, G., & Liang, L. (2011). Equating subscores under the non equivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30(1), 2335. https://doi.org/10.1111/j.1745-3992.2010.00197.xCrossRefGoogle Scholar
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266285. https://doi.org/10.1080/08957347.2010.486287CrossRefGoogle Scholar
R Core Team. (2022). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. www.R-project.org/Google Scholar
Ramsay, J. O. (1973). The effect of number of categories in rating scales on precision of estimation of scale values. Psychometrika, 38(4, Pt. 1), 513532. https://doi.org/10.1007/bf02291492CrossRefGoogle Scholar
Rasch, G. (1966). An individualistic approach to item analysis. In Lazarsfeld, P. F., & Henry, N. W. (Eds.), Readings in mathematical social science (pp. 89107). Cambridge, MA: MIT Press.Google Scholar
Raymond, M. R. (2001). Job analysis and the specification of content for licensure and certification examinations. Applied Measurement in Education, 14, 369415. https://doi.org/10.1207/s15324818ame1404_4CrossRefGoogle Scholar
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. https://doi.org/10.1007/978-0-387-89976-3CrossRefGoogle Scholar
Reckase, M. D., & Xu, J. R. (2014). The evidence for a subscore structure in a test of English language competency for English language learners. Educational and Psychological Measurement, 75, 805825. https://doi.org/10.1177/0013164414554416CrossRefGoogle Scholar
Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic assessments. Educational Measurement: Issues and Practice, 29(3), 2538. https://doi.org/10.1111/j.1745-3992.2010.00181.xCrossRefGoogle Scholar
Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. H. (2007). The fusion model skills diagnostic system. In Leighton, J., & Gierl, M. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 275318). New York: Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.010CrossRefGoogle Scholar
Rupp, A. A., & Templin, J. L. (2009). The (un)usual suspects? A measurement community in search of its identity. Measurement, 7(2), 115121. https://doi.org/10.1080/15366360903187700Google Scholar
Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.Google Scholar
Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association. https://doi.org/10.1037%2F10244-000CrossRefGoogle Scholar
Sawaki, Y., & Sinharay, S. (2018). Do the TOEFL iBT® section scores provide value-added information to stakeholders? Language Testing, 35, 529556. https://doi.org/10.1177/0265532217716731CrossRefGoogle Scholar
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150174. https://doi.org/10.1111/j.1745-3984.2010.00106.xCrossRefGoogle Scholar
Sinharay, S. (2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32, 3842. https://doi.org/10.1111/emip.12021CrossRefGoogle Scholar
Sinharay, S. (2014). Analysis of added value of subscores with respect to classification. Journal of Educational Measurement, 51, 212222. https://doi.org/10.1111/jedm.12043CrossRefGoogle Scholar
Sinharay, S., & Haberman, S. J. (2008). How much can we reliably know about what students know? Measurement: Interdisciplinary Research and Perspectives, 6, 4649. https://doi.org/10.1080/15366360802715486Google Scholar
Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122145. https://doi.org/10.1111/j.1745-3984.2011.00137.xCrossRefGoogle Scholar
Sinharay, S., & Haberman, S. J. (2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14, 2248. https://doi.org/10.1080/15305058.2013.822712CrossRefGoogle Scholar
Sinharay, S., Haberman, S. J., & Lee, Y. -H. (2011). When does scale anchoring work? A case study. Journal of Educational Measurement, 48(1), 6180. https://doi.org/10.1111/j.1745-3984.2011.00131.xCrossRefGoogle Scholar
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 2128. https://doi.org/10.1111/j.1745-3992.2007.00105.xCrossRefGoogle Scholar
Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789797. https://doi.org/10.1177/0013164410391782CrossRefGoogle Scholar
Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic subscores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45, 553573. https://doi.org/10.1080/00273171.2010.483382CrossRefGoogle ScholarPubMed
Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 2940. https://doi.org/10.1111/j.1745-3992.2011.00208.xCrossRefGoogle Scholar
Sinharay, S., Puhan, G., Haberman, S. J., & Hambleton, R. K. (2019). Subscores: When to communicate them, what are their alternatives, and some recommendations. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 3549). New York: Routledge. https://doi.org/10.4324/9781351136501-5Google Scholar
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70, 357375. https://doi.org/10.1177/0013164409355694CrossRefGoogle Scholar
Slater, S., Livingston, S. L., & Silver, M. (2019). Score reports for large-scale testing programs. In Zapata-Rivera, D. (Ed.), Score reporting research and applications (The NCME Applications of Educational Measurement and Assessment Book Series) (pp. 91106). New York, NY: Routledge. https://doi.org/10.4324/9781351136501-10Google Scholar
South Carolina College- and Career-Ready Assessments (SC READY). (2023). Individual student report. https://ed.sc.gov/tests/tests-files/sc-ready-files/spring-2022-sample-individual-student-report-english/Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72101. https://doi.org/10.2307/1412159CrossRefGoogle Scholar
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271295. https://doi.org/10.1111/j.2044-8295.1910.tb00206.xGoogle Scholar
Spencer, B. D. (Ed.). (1997). Statistics and public policy. Oxford: Clarendon Press.CrossRefGoogle Scholar
Stanton, H. C., & Reynolds, C. R. (2000). Configural frequency analysis as a method of determining Wechsler profile types. School Psychology Quarterly, 15(4), 434448. https://doi.org/10.1037/h0088799CrossRefGoogle Scholar
Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 6386. https://doi.org/10.1080/08957340903423651CrossRefGoogle Scholar
Swanson, D. B., Case, S. M., & Nungester, R. J. (1991). Validity of NBME Part I and Part II scores in prediction of Part III performance. Academic Medicine, 66, S7–S9. https://doi.org/10.1097/00001888-199109001-00004Google ScholarPubMed
Tanaka, V. (2023). A framework for reporting technically-sound and useful subscores on state assessments. www.nciea.org/blog/promoting-effective-practices-for-subscore-reporting-and-use/Google Scholar
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345354. https://doi.org/10.1111/j.1745-3984.1983.tb00212.xCrossRefGoogle Scholar
Theil, H. (1970). On the estimation of relationships involving qualitative variables. American Journal of Sociology, 76, 103154. https://doi.org/10.1086/224909CrossRefGoogle Scholar
Thissen, D. (2013). Using the testlet response model as a shortcut to multidimensional item response theory subscore computation. In Millsap, R., van der Ark, L., Bolt, D., & Woods, C. (Eds.), New developments in quantitative psychology: Presentations from the 77th Annual Psychometric Society Meeting (pp. 2940). New York: Springer. https://doi.org/10.1007/978-1-4614-9348-8_3CrossRefGoogle Scholar
Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.Google Scholar
United States Medical Licensing Examination (USMLE). (2023). Updated sample Step 2 CK annual school report. www.nbme.org/sites/default/files/2022-08/2022_Enhanced_USMLE_Step_2_CK_School_Report_Sample.pdfGoogle Scholar
Von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287307. https://doi.org/10.1348/000711007x193957CrossRefGoogle ScholarPubMed
Wainer, H. (1984). How to display data badly. The American Statistician, 38(2), 137147. https://doi.org/10.2307/2683253Google Scholar
Wainer, H. (1997). Visual revelations. New York: Copernicus Press. https://doi.org/10.4324/9780203774793CrossRefGoogle Scholar
Wainer, H. (2009). Picturing the uncertain world. Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400832897CrossRefGoogle Scholar
Wainer, H. (2015). On the crucial role of empathy in the design of communications: Genetic testing as an example. In Truth or truthiness: Distinguishing fact from fiction by learning to think like a data scientist (pp. 8290). Cambridge: Cambridge University Press. https://doi:10.1017/CBO9781316424315.012CrossRefGoogle Scholar
Wainer, H., Dorans, D. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410605931CrossRefGoogle Scholar
Wainer, H., & Feinberg, R. A. (2017). For want of a nail: Why unnecessarily long tests may be impeding the progress of Western civilization. In Pitici, M. (Ed.), The best writing on mathematics 2016 (pp. 321330). Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400885602-030Google Scholar
Wainer, H., Gessaroli, M., & Verdi, M. (2006). Finding what is not there through the unfortunate binning of results: The Mendel Effect. Chance, 19(1), 4952. https://doi.org/10.1080/09332480.2006.10722771CrossRefGoogle Scholar
Wainer, H., & Robinson, D. (2023). Why testing? Why should it cost you? Chance, 36(1), 4852. https://doi.org/10.1080/09332480.2023.2179281CrossRefGoogle Scholar
Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113140. https://doi.org/10.1111/j.1745-3984.2000.tb01079.xCrossRefGoogle Scholar
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L., et al. (2001). Augmented scores: “Borrowing strength” to compute scores based on small numbers of items. In Thissen, D. & Wainer, H. (Eds.), Test scoring (pp. 343387). Mahwah, NJ: Erlbaum Associates. https://doi.org/10.4324/9781410604729-16Google Scholar
Wang, X., Svetina, D., & Dai, S. (2019). Exploration of factors affecting the added value of test subscores. Journal of Experimental Education, 87, 179192. https://doi.org/10.1080/00220973.2017.1409182CrossRefGoogle Scholar
Wilson, K. M. (2000). An exploratory dimensionality assessment of the TOEIC test. ETS Research Report Series (ETS Research Report No. RR-00-14). https://doi.org/10.1002/j.2333-8504.2000.tb01837.xCrossRefGoogle Scholar
Yao, L., Sinharay, S., & Haberman, S. J. (2014). Documentation for the software package SQE (ETS Research Memorandum No. RM-14-02). Educational Testing Service.Google Scholar
Yen, W. M. (1987). A Bayesian/IRT index of objective performance. Paper presented at the meeting of the Psychometric Society, Montreal, Canada.Google Scholar
Zapata-Rivera, D., VanWinkle, W., & Zwick, R. (2012). Applying score design principles in the design of score reports for CBAL™ Teachers. (ETS Research Memorandum RM–12-20). Princeton, NJ: Educational Testing Service.Google Scholar
Zenisky, A. L., & Hambleton, R. K. (2012). Developing test score reports that work: The process and best practices for effective communication. Educational Measurement: Issues and Practice, 31(2), 2126. https://doi.org/10.1111/j.1745-3992.2012.00231.xCrossRefGoogle Scholar
Zenisky, A. L., & Hambleton, R. K. (2015). A model and good practices for score reporting. In Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.), Handbook of test development (2nd ed., pp. 585602). New York: Routledge.Google Scholar
Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.Google Scholar
Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping on the national assessment of educational progress. Educational Measurement: Issues and Practice, 20(2), 1525. https://doi.org/10.1111/j.1745-3992.2001.tb00059.xCrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×