Hostname: page-component-848d4c4894-pftt2 Total loading time: 0 Render date: 2024-05-16T16:32:50.029Z Has data issue: false hasContentIssue false

On clustering levels of a hierarchical categorical risk factor

Published online by Cambridge University Press:  01 February 2024

Bavo D.C. Campo*
Affiliation:
Faculty of Economics and Business, KU Leuven, Belgium
Katrien Antonio
Affiliation:
Faculty of Economics and Business, KU Leuven, Belgium Faculty of Economics and Business, University of Amsterdam, Amsterdam, The Netherlands LRisk, Leuven Research Center on Insurance and Financial Risk Analysis, KU Leuven, Belgium LStat, Leuven Statistics Research Center, KU Leuven, Belgium
*
Corresponding author: Bavo D.C. Campo; Email: bavo.campo@kuleuven.be

Abstract

Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers’ compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies.

Type
Original Research Paper
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahmad, A., Ray, S. K. & Aswani Kumar, C. (2019). Clustering mixed datasets by using similarity features. In Sustainable Communication Networks and Application. Lecture Notes on Data Engineering and Communications Technologies (pp. 478485). Springer International Publishing.Google Scholar
Argyrou, A. (2009). Clustering hierarchical data using self-organizing map: a graph-theoretical approach. In Advances in Self-Organizing Maps. Vol. 5629 of Lecture Notes in Computer Science (pp. 1927). Springer Berlin Heidelberg.Google Scholar
Arora, S., May, A., Zhang, J. & , C. (2020). Contextual embeddings: when are they worth it? arXiv: 2005.09117. Available at: https://arxiv.org/abs/2005.09117 Google Scholar
Australian Bureau of Statistics and New Zealand (2006). Australian and New Zealand Standard Industrial Classification, (ANZSIC) 2006. Australian Bureau of Statistics.Google Scholar
Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 925.Google Scholar
Brown, H. & Prescott, R. (2006). Applied Mixed Models in Medicine. Wiley.CrossRefGoogle Scholar
Caliński, T. & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 127.Google Scholar
Campo, B. D. & Antonio, K. (2023). Insurance pricing with hierarchically structured data an illustration with a workers’ compensation insurance portfolio. Scandinavian Actuarial Journal, 2023(9), 853884. doi: 10.1080/03461238.2022.2161413.CrossRefGoogle Scholar
Carrizosa, E., Galvis Restrepo, M. & Romero Morales, D. (2021). On clustering categories of categorical predictors in generalized linear models. Expert Systems with Applications, 182, 115245.CrossRefGoogle Scholar
Carrizosa, E., Mortensen, L. H., Romero Morales, D. & Sillero-Denamiel, M. R. (2022). The tree based linear regression model for hierarchical categorical variables. Expert Systems with Applications, 203, 117423.CrossRefGoogle Scholar
Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B. & Kurzweil, R. (2018). Universal Sentence Encoder. arXiv: 1803.11175. Available at: https://arxiv.org/abs/1803.11175 Google Scholar
Cheung, Y.-m. & Jia, H. (2013). Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition, 46(8), 22282238.CrossRefGoogle Scholar
Costa, I. G., de Carvalho, F.d A. & de Souto, M. C. (2004). Comparative analysis of clustering methods for gene expression time course data. Genetics and Molecular Biology, 27(4), 623631.CrossRefGoogle Scholar
Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224227.CrossRefGoogle Scholar
de Souto, M. C., Costa, I. G., de Araujo, D. S., Ludermir, T. B. & Schliep, A. (2008). Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9(1), 114.CrossRefGoogle ScholarPubMed
Denuit, M., Sznajder, D. & Trufin, J. (2019). Model selection based on Lorenz and concentration curves, Gini indices and convex order. Insurance, Mathematics & Economics, 89, 128139.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding, arXiv: 1810.04805. Available at: https://arxiv.org/abs/1810.04805 Google Scholar
Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95104.CrossRefGoogle Scholar
Ebnesajjad, S. (2011). 8 - Characteristics of adhesive materials. In S. Ebnesajjad (ed.), Handbook of Adhesives and Surface Preparation. Plastics Design Library (pp. 137183). William Andrew Publishing.CrossRefGoogle Scholar
European Central Bank (2021). Loans from euro area monetary financial institutions to non-financial corporations by economic activity: explanatory notes. https://www.ecb.europa.eu/stats/pdf/money/explanatory_notes_nace-en_sdw_dissemination_en.pdf?993f98fe6b628ebc6ff44b0af3d2e362.Google Scholar
Eurostat (2008). NACE Rev. 2: statistical classification of economic activities in the European community. Eurostat: Methodologies and Working Papers.Google Scholar
Everitt, B., Landau, S. & Leese, M. (2011). Cluster Analysis. 5th edition, Wiley.CrossRefGoogle Scholar
Ferrario, A. & Naegelin, M. (2020). The art of natural language processing: classical, modern and contemporary approaches to text document classification. Available at: https://ssrn.com/abstract=3547887.Google Scholar
FOD Economie (2004). NACE-Bel: Activiteitennomenclatuur. Algemene Directie Statistiek en Economische Informatie.Google Scholar
Foss, A. H., Markatou, M. & Ray, B. (2019). Distance metrics and clustering methods for mixed-type data. International Statistical Review, 87(1), 80109.CrossRefGoogle Scholar
Fränti, P. & Sieranoja, S. (2019). How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93, 95112.CrossRefGoogle Scholar
Gelman, A. & Hill, J. (2017). Data Analysis Using Regression and Multilevel/Hierarchical Models. 17th pr. edition. Cambridge University Press.Google Scholar
Gertheiss, J. & Tutz, G. (2010). Sparse modeling of categorial explanatory variables. Annals of Applied Statistics, 4(4), 21502180.CrossRefGoogle Scholar
Gini, C. (1921). Measurement of inequality of incomes. The Economic Journal, 31(121), 124126.CrossRefGoogle Scholar
Govender, P. & Sivakumar, V. (2020). Application of k-means and hierarchical clustering techniques for analysis of air pollution: a review (1980-2019. Atmospheric Pollution Research, 11(1), 4056.CrossRefGoogle Scholar
Guo, C. & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv: 1604.06737. Available at: https://arxiv.org/abs/1604.06737 Google Scholar
Haberman, S. & Renshaw, A. E. (1996). Generalized linear models and actuarial science. Journal of the Royal Statistical Society: Series D (The Statistician), 45(4), 407436.Google Scholar
Halkidi, M., Batistakis, Y. & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2), 107145.CrossRefGoogle Scholar
Hastie, T., Tibshirani, R. & Friedman, J. (2009). Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.CrossRefGoogle Scholar
Henckaerts, R., Antonio, K., Clijsters, M. & Verbelen, R. (2018). A data driven binning strategy for the construction of insurance tariff classes. Scandinavian Actuarial Journal, 2018(8), 681705.CrossRefGoogle Scholar
Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 5362.CrossRefGoogle Scholar
Höfling, H., Binder, H. & Schumacher, M. (2010). A coordinate-wise optimization algorithm for the fused lasso, arXiv: 1011.6409. Available at: https://arxiv.org/abs/1011.6409 Google Scholar
Holizki, T., McDonald, R., Foster, V. & Guzmicky, M. (2008). Causes of work-related injuries among young workers in British Columbia. American Journal of Industrial Medicine, 51(5), 357363.CrossRefGoogle ScholarPubMed
Hsu, C.-C. (2006). Generalizing self-organizing map for categorical data. IEEE Transactions on Neural Networks, 17(2), 294304.CrossRefGoogle ScholarPubMed
Jewell, W. S. (1975). The use of collateral data in credibility theory : a hierarchical model. Giornale dell’Istituto Italiano degli Attuari, 38, 116.Google Scholar
Jung, Y. G., Kang, M. S. & Heo, J. (2014). Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 28(sup1), S44S48.CrossRefGoogle ScholarPubMed
Kaufman, L. & Rousseeuw, P. (1990a). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.CrossRefGoogle Scholar
Kaufman, L. & Rousseeuw, P. (1999b). Partitioning Around Medoids (Program PAM). Chapter 2 (pp. 68125). John Wiley & Sons, Ltd.Google Scholar
Kinnunen, T., Sidoroff, I., Tuononen, M. & Fränti, P. (2011). Comparison of clustering methods: a case study of text-independent speaker modeling. Pattern Recognition Letters, 32(13), 16041617.CrossRefGoogle Scholar
Kogan, J., Nicholas, C. & Teboulle, M. (2005). Grouping Multidimensional Data: Recent Advances in Clustering. Springer Berlin/ Heidelberg.Google Scholar
Kohonen, T. (1995). Self-organizing Maps. Springer.CrossRefGoogle Scholar
Kou, G., Peng, Y. & Wang, G. (2014). Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences, 275, 112.CrossRefGoogle Scholar
Lee, G. Y., Manski, S. & Maiti, T. (2020). Actuarial applications of word embedding models. ASTIN Bulletin: The Journal of the IAA, 50(1), 124.CrossRefGoogle Scholar
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J. & Wu, S. (2013). Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics, 43(3), 982994.Google ScholarPubMed
Luong, T., Socher, R. & Manning, C. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Sofia, Bulgaria (pp. 104113). https://aclanthology.org/W13-3512 Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol 1. Oakland, CA, USA (pp. 281297).Google Scholar
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. (2022). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.4. Available at: https://CRAN.R-project.org/package=cluster.Google Scholar
Mangiameli, P., Chen, S. K. & West, D. (1996). A comparison of SOM neural network and hierarchical clustering methods. European Journal of Operational Research, 93(2), 402417.CrossRefGoogle Scholar
McNicholas, P. (2016a). Mixture Model-Based Classification. CRC Press.CrossRefGoogle Scholar
McNicholas, P. D. (2016b). Model-based clustering. Journal of Classification, 33(3), 331373.CrossRefGoogle Scholar
Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explorations, 3(1), 2732.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space, arXiv: 1301.3781. Available at: https://arxiv.org/abs/1301.3781 Google Scholar
Mohammad, S. M. & Hirst, G. (2012). Distributional measures of semantic distance: a survey, arXiv: 1203.1858. Available at: https://arxiv.org/abs/1203.1858 Google Scholar
Molenberghs, G. & Verbeke, G. (2005). Models for Discrete Longitudinal Data. Springer New York.Google Scholar
Molenberghs, G. & Verbeke, G. (2011). A note on a hierarchical interpretation for negative variance components. Statistical Modelling, 11(5), 389408.CrossRefGoogle Scholar
Murugesan, N., Cho, I. & Tortora, C. (2021). Benchmarking in cluster analysis: a study on spectral clustering, dbscan, and k-means. In Data Analysis and Rationality in a Complex World. Vol. 5 of Studies in Classification, Data Analysis, and Knowledge Organization (pp. 175185). Springer International Publishing.Google Scholar
Ng, A., Jordan, M. & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14.Google Scholar
Oelker, M.-R., Gertheiss, J. & Tutz, G. (2014). Regularization and model selection with categorical predictors and effect modifiers in generalized linear models. Statistical Modelling, 14(2), 157177.CrossRefGoogle Scholar
Ohlsson, E. (2008). Combining generalized linear models and credibility models in practice. Scandinavian Actuarial Journal, 2008(4), 301314.CrossRefGoogle Scholar
Ohlsson, E. & Johansson, B. (2010). Non-Life Insurance Pricing with Generalized Linear Models. Springer Berlin Heidelberg: Imprint: Springer.CrossRefGoogle Scholar
Oliveira, I., Molenberghs, G., Verbeke, G., Demetrio, C. & Dias, C. (2017). Negative variance components for non-negative hierarchical data with correlation, over-, and/or underdispersion. Journal of Applied Statistics, 44(6), 10471063.CrossRefGoogle Scholar
Onan, A. (2017). A k-medoids based clustering scheme with an application to document clustering. In 2017 International Conference on Computer Science and Engineering (UBMK) (pp. 354359).CrossRefGoogle Scholar
Ostrovsky, R., Rabani, Y., Schulman, L. & Swamy, C. (2012). The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM, 59(6), 122.CrossRefGoogle Scholar
Pargent, F., Pfisterer, F., Thomas, J. & Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, 37(5), 26712692.CrossRefGoogle Scholar
Phillips, J. M. (2021). Mathematical Foundations for Data Analysis. Springer International Publishing.CrossRefGoogle Scholar
Pinheiro, J., Pinheiro, J. & Bates, D. (2009). Mixed-Effects Models in S and S-PLUS. Springer.Google Scholar
Poon, L. K. M., Liu, A. H., Liu, T. & Zhang, N. L. (2012). A model-based approach to rounding in spectral clustering, arXiv: 1210.4883. Available at: https://arxiv.org/abs/1210.4883 Google Scholar
Pryseley, A., Tchonlafi, C., Verbeke, G. & Molenberghs, G. (2011). Estimating negative variance components from gaussian and non-gaussian data: a mixed models approach. Computational Statistics & Data Analysis, 55(2), 10711085.CrossRefGoogle Scholar
R Core Team. (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/.Google Scholar
Rentzmann, S. & Wuthrich, M. V. (2019). Unsupervised learning: what is a sports car? Available at: https://ssrn.com/abstract=3439358 or 10.2139/ssrn.3439358.Google Scholar
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa, L.d F. & Rodrigues, F. A. (2019). Clustering algorithms: a comparative approach. PloS One, 14(1), e0210236.CrossRefGoogle ScholarPubMed
Rosenberg, M. & Zhong, F. (2022). Using clusters based on social determinants to identify the top 5% utilizers of health care. North American Actuarial Journal, 26(3), 456469.CrossRefGoogle Scholar
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 5365.CrossRefGoogle Scholar
Schomacker, T. & Tropmann-Frick, M. (2021). Language representation models: an overview. Entropy (Basel, Switzerland), 23(11), 1422.CrossRefGoogle ScholarPubMed
Schubert, E. (2021). A triangle inequality for cosine similarity. In Similarity Search and Applications. Lecture Notes in Computer Science (pp. 3244). Springer International Publishing.CrossRefGoogle Scholar
Schwertman, N. C., Owens, M. A. & Adnan, R. (2004). A simple more general boxplot method for identifying outliers. Computational Statistics & Data Analysis, 47(1), 165174.CrossRefGoogle Scholar
Stassen, B., Denuit, M., Mahy, S., Maréchal, X. & Trufin, J. (2017). A unified approach for the modelling of rating factors in workers compensation insurance. White paper. Reacfin. Available at: https://www.reacfin.com/wp-content/uploads/2016/12/170131-Reacfin-White-Paper-A-Unified-Approach-for-the-Modeling-of-Rating-Factors-in-Work-ers%E2%80%99-Compensation-Insurance.pdf.Google Scholar
Statistical Office of the European Communities (1996). NACE Rev. 1: Statistical Classification of Economic Activities in the European Community. Office for Official Publications of the European Communities.Google Scholar
Struyf, A., Hubert, M. & Rousseeuw, P. (1997). Clustering in an object-oriented environment. Journal of Statistical Software, 1(1), 130.Google Scholar
Timm, N. H. (2002). Applied Multivariate Analysis. Springer.Google Scholar
Troxler, A. & Schelldorfer, J. (2022). Actuarial applications of natural language processing using transformers: case studies for using text features in an actuarial context, arXiv: 2206.02014. Available at: https://arxiv.org/abs/2206.02014 Google Scholar
Tutz, G. & Oelker, M. (2017). Modelling clustered heterogeneity: fixed effects, random effects and mixtures. International Statistical Review, 85(2), 204227.CrossRefGoogle Scholar
Van Der Maaten, L. & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 25792625.Google Scholar
Vendramin, L., Campello, R. J. G. B. & Hruschka, E. R. (2010). Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining: The ASA Data Science Journal, 3(4), 209235.CrossRefGoogle Scholar
Verma, D. & Meila, M. (2003). A Comparison of Spectral Clustering Algorithms. Technical Report. University of Washington Tech Rep UWCSE030501.Google Scholar
Verma, V. K., Pandey, M., Jain, T. & Tiwari, P. K. (2021). Dissecting word embeddings and language models in natural language processing. Journal of Discrete Mathematical Sciences & Cryptography, 24(5), 15091515.CrossRefGoogle Scholar
von Luxburg, U. (2007). A tutorial on spectral clustering, arXiv: 0711.0189. Available at: https://arxiv.org/abs/0711.0189 Google Scholar
von Luxburg, U., Belkin, M. & Bousquet, O. (2008). Consistency of spectral clustering. The Annals of Statistics, 36(2), 555586.CrossRefGoogle Scholar
von Luxburg, U., Bousquet, O. & Belkin, M. (2004). On the convergence of spectral clustering on random samples: The normalized case. In ‘LEARNING THEORY, PROCEEDINGS’. Vol. 3120 of Lecture Notes in Computer Science (pp. 457471). Springer Berlin Heidelberg.CrossRefGoogle Scholar
Walters, J. K., A. Christensen, K., K. Green, M., E. Karam, L., D. Kincl, L. (2010). Occupational injuries to oregon workers 24 years and younger: an analysis of workers’ compensation claims, 2000-2007. American Journal of Industrial Medicine, 53(10), 984994.CrossRefGoogle ScholarPubMed
Wang, X. & Keogh, E. (2008). A clustering analysis for target group identification by locality in motor insurance industry. In ‘Soft Computing Applications in Business’. Vol. 230 of Studies in Fuzziness and Soft Computing (pp. 113127). Springer Berlin Heidelberg.Google Scholar
Wierzchoń, S. & Kłopotek, M. (2019). Modern Algorithms of Cluster Analysis. Springer International Publishing.Google Scholar
Wurzelbacher, S. J., Meyers, A. R., Lampl, M. P., Timothy Bushnell, P., Bertke, S. J., Robins, D. C., Tseng, C.-Y. & Naber, S. J. (2021). Workers’ compensation claim counts and rates by injury event/exposure among state-insured private employers in ohio, 2007-2017. Journal of Safety Research, 79, 148167.CrossRefGoogle ScholarPubMed
Wüthrich, M. V. (2017). Covariate selection from telematics car driving data. European Actuarial Journal, 7(1), 89108.CrossRefGoogle Scholar
Xu, S., Zhang, C. & Hong, D. (2022). BERT-based NLP techniques for classification and severity modeling in basic warranty data study. Insurance: Mathematics and Economics, 107, 5767.Google Scholar
Yeo, A. C., Smith, K. A., Willis, R. J. & Brooks, M. (2001). Clustering technique for risk classification and prediction of claim costs in the automobile insurance industry. Intelligent Systems in Accounting, Finance and Management, 10(1), 3950.Google Scholar
Yu, D., Liu, G., Guo, M. & Liu, X. (2018). An improved k-medoids algorithm based on step increasing and optimizing medoids. Expert Systems with Applications, 92, 464473.CrossRefGoogle Scholar
Zappa, D., Borrelli, M., Clemente, G. P. & Savelli, N. (2021). Text mining in insurance: from unstructured data to meaning. Variance, 14(1), 1–15.Google Scholar
Zhu, R. & Wüthrich, M. V. (2021). Clustering driving styles via image processing. Annals of Actuarial Science, 15(2), 276290.CrossRefGoogle Scholar