open access publication

Article, 2024

A binarization approach to model interactions between categorical predictors in Generalized Linear Models

Applied Intelligence, ISSN 0924-669X, 10.1007/s10489-024-05576-x

Contributors

Carrizosa E. 0000-0002-0832-8700 [1] Galvis Restrepo M. 0000-0001-5639-0018 (Corresponding author) Romero Morales D. 0000-0001-7945-1469 [2]

Affiliations

  1. [1] Universidad de Sevilla
  2. [NORA names: Spain; Europe, EU; OECD];
  3. [2] Copenhagen Business School
  4. [NORA names: CBS Copenhagen Business School; University; Denmark; Europe, EU; Nordic; OECD]

Abstract

In this paper, our goal is to enhance the interpretability of Generalized Linear Models by identifying the most relevant interactions between categorical predictors. Searching for interaction effects can quickly become a highly combinatorial, and thus computationally costly, problem when we have many categorical predictors or even a few of them but with many categories. Moreover, the estimation of coefficients requires large training samples with enough observations for each interaction between categories. To address these bottlenecks, we propose to find a reduced representation for each categorical predictor as a binary predictor, where categories are clustered based on a dissimilarity. We provide a collection of binarized representations for each categorical predictor, where the dissimilarity takes into account information from the main effects and the interactions. The choice of the binarized predictors representing the categorical predictors is made with a novel heuristic procedure that is guided by the accuracy of the so-called binarized model. We test our methodology on both real-world and simulated data, illustrating that, without damaging the out-of-sample accuracy, our approach trains sparse models including only the most relevant interactions between categorical predictors.

Keywords

Categorical predictors, Clustering of categories, Generalized linear models, Interactions, Interpretability

Funders

  • Ministry of Science, Innovation and Universities
  • Junta de AndalucĂ­a
  • EC H2020 MSCA

Data Provider: Elsevier