Impact Statement
While causal feature selection helps design more robust ML models, its joint application to multiple datasets remains limited because standard causal discovery algorithms output a different set of drivers for each dataset, which is impractical. To mitigate this issue, we apply a newly developed “multidata” causal feature selection approach, which identifies a single set of optimal causal drivers from an ensemble of multivariate time series. Applied to the statistical prediction of TC intensity, our approach outperforms standard feature selection methods by helping simple regression algorithms better generalize to unseen cases. In addition to making our models robust, causal feature selection also eliminates redundant predictors while identifying new ones, leading to lighter models and aiding scientific discovery.
1. Introduction
Machine learning (ML) combines statistical methods and numerical optimization to learn a group of tasks from data. Progress in computational capabilities, combined with the availability of large amounts of data, allows the development of ML models to predict and understand nonlinear systems such as climate processes and extreme weather events. For environmental applications, processing big data that are nonlinearly related often requires (a) dimensionality reduction; and (b) strategically selecting the model’s features to make ML models cheaper to run, generalize better, and easier to explain (Guyon and Elisseeff, Reference Guyon and Elisseeff2003; Yu et al., Reference Yu, Guo, Liu, Li, Wang, Ling and Wu2020, Reference Yu, Yang and Ding2022). This article compares different methods to discover a subset of the most relevant features in environmental datasets (Guyon and Elisseeff, Reference Guyon and Elisseeff2003; Post et al., Reference Post, Putten and Rijn2016; Li et al., Reference Li, Cheng, Wang, Morstatter, Trevino, Tang and Liu2017) and explores the effect of causal feature selection on statistical prediction skill. For this, we work at the intersection of causal inference and ML, an active area of research (Chen et al., Reference Chen, Harinen, Lee, Yung and Zhao2020) because causal relations help acquire robust knowledge beyond the support of observed data distributions (Schölkopf et al., Reference Runge, Petoukhov, Donges, Hlinka, Jajcay, Vejmelka, Hartman, Marwan, Paluš and Kurths2021). Causal inference can broadly be categorized into three research directions: (a) causal representation learning; (b) causal discovery; and (c) causal reasoning (Schölkopf et al., Reference Runge, Petoukhov, Donges, Hlinka, Jajcay, Vejmelka, Hartman, Marwan, Paluš and Kurths2021; Kaddour et al., Reference Kaddour, Lynch, Liu, Kusner and Silva2022). To select features, we here explore the use of causal discovery, a methodology for learning qualitative cause-and-effect relationships between a collection of variables from data that have not been obtained under controlled experimental conditions (Spirtes et al., Reference Spirtes, Glymour, Scheines and Heckerman2000; Peters et al., Reference Peters, Janzing and Schölkopf2017). Incorporating causal relationships in ML models via feature selection can make ML models more interpretable (Guyon and Elisseeff, Reference Guyon and Elisseeff2003; Runge et al., Reference Runge, Petoukhov, Donges, Hlinka, Jajcay, Vejmelka, Hartman, Marwan, Paluš and Kurths2015b; Yu et al., Reference Yu, Yang and Ding2022; Iglesias-Suarez et al., Reference Iglesias-Suarez, Gentine, Solino-Fernandez, Beucler, Pritchard, Runge and Eyring2023) and less susceptible to overfitting (Aliferis et al., Reference Aliferis, Statnikov, Tsamardinos, Mani and Koutsoukos2010a, Reference Aliferis, Statnikov, Tsamardinos, Mani and Koutsoukos2010b; Runge et al., Reference Runge, Donner and Kurths2015a).
There are two main challenges when applying causal discovery in environmental sciences. The first challenge is algorithmic: Often environmental data consists of multiple realizations of the same process with slight differences, and causal discovery algorithms that apply to such multiple realization problems remain underexploited (Yu et al., Reference Yu, Guo, Liu, Li, Wang, Ling and Wu2020). The second challenge is the lack of benchmarking: Causal feature selection is rarely compared against other feature selection methods for ML-based predictions. Here, we address these two gaps by introducing a causal feature selection framework to estimate causal relationships from multiple time series datasets (Runge et al., Reference Runge, Donner and Kurths2015a, Reference Runge, Petoukhov, Donges, Hlinka, Jajcay, Vejmelka, Hartman, Marwan, Paluš and Kurths2015b, Reference Runge, Bathiany, Bollt, Camps-Valls, Coumou, Deyle, Glymour, Kretschmer, Mahecha, Muñoz-Marí, van Nes, Peters, Quax, Reichstein, Scheffer, Schölkopf, Spirtes, Sugihara, Sun, Zhang and Zscheischler2019a, Reference Runge, Nowack, Kretschmer, Flaxman and Sejdinovic2019b; Yu et al., Reference Yu, Liu, Li, Ding and Le2019; Runge, Reference Runge2020). We compare feature selection algorithms by training simple ML prediction models for each of the selected sets of features and evaluating their predictive performances. Our framework is applied to the prediction of tropical cyclone (TC) intensity to demonstrate that causal feature selection (a) improves the out-of-sample skill, and (b) uncovers the best predictors in real-world situations.
2. Methodology: Causal Feature Selection for Multiple Realizations
Our implementation of causal feature selection, Geiger et al. (Reference Geiger, Verma and Pearl1990), Pena et al. (Reference Pena, Nilsson, Björkegren and Tegnér2007), and Gao and Ji (Reference Gao and Ji2017) uses the recently developed multidata (M) functionality for two causal discovery algorithms based on time-series, explained below. Our multidata causal discovery approach used to preselect causally relevant predictors has two steps: (a) the causal discovery algorithms; and (b) applying these algorithms to a dataset comprising data from multiple sources. From a causal perspective, the setup used in this study is simplified because only the variables that are time-lagged with respect to the target variables are considered potential predictors. As a result, causal discovery effectively reduces to a feature selection algorithm that removes all those predictors which are (conditionally) independent of the target (given the other predictors) and which hence do not provide any additional information for predicting the target. The multidata functionality itself, however, is more general and also applies to the time series causal discovery tasks that also consider contemporaneous causal relationships.
Here, we explore the use of the causal discovery algorithms PC $ {}_1 $ and PCMCI (Runge et al., Reference Runge, Nowack, Kretschmer, Flaxman and Sejdinovic2019b) for causal feature selection. The PC $ {}_1 $ algorithm is a variant of the PC algorithm (Spirtes et al., Reference Spirtes, Glymour, Scheines and Heckerman2000). First, PC $ {}_1 $ initializes the potential causal drivers $ pa\left({Y}_t\right) $ of each target variable $ {Y}_t $ as the set of all variables $ pa\left({Y}_t\right)=\left\{{X}_{t-\tau}^i|i=1,\dots, {N}_X,\tau ={\tau}_{min},\dots, {\tau}_{\mathrm{max}}\right\} $ within the considered range $ \left[{\tau}_{min},{\tau}_{max}\right] $ of time lags, where the $ {X}^i $ with $ i=1,\dots, {N}_X $ are the potential predictors and where $ {\tau}_{min} $ and $ {\tau}_{max} $ , respectively, are the minimal and maximal time lags at which direct causal influences can occur. Then, PC $ {}_1 $ iteratively removes variables from $ pa\left({Y}_t\right) $ that are irrelevant or redundant for the prediction of $ {Y}_t $ . Specifically, PC $ {}_1 $ removes elements $ {X}_{t-\tau}^i $ from $ pa\left({Y}_t\right) $ that are conditionally independent of $ {Y}_t $ given subsets $ {S}_k\hskip0.2em \subseteq \hskip0.2em pa\left({Y}_t\right) $ whose cardinality $ k $ increases iteratively: For $ k=0 $ all $ {X}_{t-\tau}^i $ with are removed, for $ k=1 $ those with , where $ {S}_1 $ is the strongest driver from the previous step, for $ k=2 $ those with , where $ {S}_2 $ are the two strongest drivers from the previous step, and so on. In this work, conditional independence is tested using partial correlation (in general, though, the algorithm can be combined with any conditional independence test). The corresponding independence test is based on a standard significance level $ {pc}_{\alpha }=0.02 $ and uses a strength of association that is based on the absolute partial correlation value. This iteration is continued until $ k $ is greater than the cardinality of $ pa\left({Y}_t\right) $ . The PC algorithm is different from PC $ {}_1 $ in so far as that PC, for every $ k $ , does not only test for conditional independence given exactly one cardinality $ k $ subset of $ pa\left({Y}_t\right) $ but tests for conditional independence given all cardinality $ k $ subsets of $ pa\left({Y}_t\right) $ .
The PCMCI algorithm (Runge et al., Reference Runge, Nowack, Kretschmer, Flaxman and Sejdinovic2019b), after first running PC $ {}_1 $ , reinitializes all links and then subjects all links to the momentary conditional independence (MCI) tests , removing the link if independence is not rejected. Here, the condition on $ pa\left({X}_{t-\tau}^i\right) $ helps to remove false positives that tend to be inflated due to autocorrelation. Controlling false positives is important for a causal discovery setting but is not necessarily important for a causal feature selection/prediction setting as considered in this article. Within the study presented here, we employ both the PC $ {}_1 $ and the PCMCI algorithm to empirically analyze which of the two methods is preferable for causal feature selection. As with all causal discovery methods, PC $ {}_1 $ and PCMCI rely on certain assumptions. The essential assumption is that conditional independencies in the data are in one-to-one correspondence with $ d $ -separations (Pearl, Reference Pearl1988) in the causal graph (Geiger et al., Reference Geiger, Verma and Pearl1990; Verma and Pearl, Reference Verma and Pearl1990; Spirtes et al., Reference Spirtes, Glymour, Scheines and Heckerman2000; Pearl, Reference Pearl2009). Moreover, both methods assume causal sufficiency (Spirtes et al., Reference Spirtes, Glymour, Scheines and Heckerman2000), that is, the absence of unobserved variables that causally influence two observed variables. The latter assumption is not necessarily fulfilled in our context, even though we included a range of potential predictors. This means that some of the obtained causal features might still be spurious and may not work if the target distribution differs from the training distribution (out-of-distribution prediction).
When PC $ {}_1 $ and PCMCI are applied to a single multivariate time series, samples are drawn from this time series in a sliding-window fashion. The drawn set is internally passed to the statistical hypothesis tests of conditional independencies. For this sliding-window approach to be valid, the causal relationships need to remain unchanged throughout the time series (causal stationarity assumption). When PC $ {}_1 $ and PCMCI are applied to multiple multivariate time series, if all time series of this ensemble can be assumed to share the same causal relationships within specific time ranges, then we can combine the sample sets from all ensemble membersFootnote 1 into a single, larger dataset. This larger dataset, which includes data from multiple sources (e.g., from multiple storms), is then internally passed to the conditional independence tests. The PC $ {}_1 $ and PCMCI algorithms can then proceed as usual. Consequently, although the input is an ensemble of multivariate time series, the output is still a single set of predictors. In addition to its practicality, our multidata approach benefits from an enlarged set of samples, increasing the power of the conditional independence tests. Hence, we expect the sets of predictors obtained by running multidata causal discovery on an ensemble of time series to be more reliable than the sets of predictors obtained by running causal discovery on any single member of this ensemble—if the assumption of a common causal structure holds. An alternative approach would be to run PC $ {}_1 $ and PCMCI on any single member time series and then appropriately aggregate the resulting sets of predictors (across the members).
3. Application: Statistical Prediction of Tropical Cyclone Intensity
3.1. Motivation
The increasing frequency of intense TCs (Emanuel, Reference Emanuel2005; Knutson et al., Reference Knutson, Camargo, Chan, Emanuel, Ho, Kossin, Mohapatra, Satoh, Sugi, Walsh and Wu2020) combined with growing coastal populations have escalated the vulnerability of the tropical urban coasts. For context, more than half of Earth’s population is projected to live in the tropics by 2050 (Edelman et al., Reference Edelman, Gelding, Konovalov, McComiskie, Penny, Roberts, Trewin, Ziembicki, Trewin, Cortlet, Hemingway, Isaac and Turton2014) and more than a billion people worldwide could be living in low-elevation coastal zones by 2060, particularly in South and East Asia (Neumann et al., Reference Neumann, Vafeidis, Zimmermann and Nicholls2015). Predicting storm intensity changes, including rapid intensification in TCs, remains a major challenge (DeMaria et al., Reference DeMaria, Sampson, Knaff and Musgrave2014), because of unresolved complexities of storm dynamics in numerical models. Furthermore, numerical models suffer from a reduction in forecast skills with an increase in lead time (Ganesh et al., Reference Ganesh, Sahai, Abhilash, Joseph, Dey, Mandal, Chattopadhyay and Phani2018). An alternative to numerical forecasting is statistical forecasting, as statistical models can improve cyclogenesis and intensity forecasts (Kim et al., Reference Kim, Park, Im, Park and Lee2019; Chen et al., Reference Chen, Zhang and Wang2020). For instance, statistical models based on logistic regression, random forest, decision tree, and randomized trees (Su et al., Reference Su, Wu, Jiang, Pai, Liu, Zhai, Tavallali and DeMaria2020) outperformed the National Hurricane Center in predicting TC rapid intensifications over the Atlantic and Eastern Pacific basins (Kaplan et al., Reference Kaplan, DeMaria and Knaff2010; Rozoff and Kossin, Reference Rozoff and Kossin2011). A potential drawback of statistical models is that it is often difficult to choose appropriate predictors for reliable forecasts. To better predict TC intensity, the models need to represent the physical mechanisms behind TC intensification more accurately; these include large-scale circulations, local conditions, and internal processes (Kaplan et al., Reference Kaplan, DeMaria and Knaff2010; Emanuel and Zhang, Reference Emanuel and Zhang2016). We argue that one way to make statistical models more robust is to apply causal discovery algorithms and eliminate causally spurious predictors. In this study, we apply the PC $ {}_1 $ and PCMCI methods to generate a single set of causally relevant predictors from multiple TC time series.
3.2. Data
The TC dataset is created using multiple environmental variables at different pressure levels known to be favorable for TC intensification (Sikora, Reference Sikora1976; Petty and Hobgood, Reference Petty and Hobgood2000; Li et al., Reference Li, Ren, Yang and Zheng2011) from the global high-resolution ECMWF ReAnalysis-5 (ERA5; Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, Rosnay, Rozum, Vamborg, Villaume and Thépaut2020) with 25 km horizontal resolution, and 3-hourly temporal resolution (see Section A of the Supplementary Material for the full list). Here, we selected a total of 260 TC cases spanning from 2001 to 2020 in the Northwest Pacific basin (WPAC). The TCs with a lifetime of more than 6 days up to landfall are selected for the study to understand the effect of environmental parameters on TC intensity up to 3 days time lag, so each case has a time span from genesis up to landfall based on each TC best trackFootnote 2. Rather than directly feeding the time series of predictor variables for the cases at each grid point around the storm, the values are averaged in horizontal areas defined with respect to the TC Center. Each atmospheric predictor is post-processed into two sets of time series representing inner-core (TC center to a radius of 200 km) and outer-core characteristics (annulus from a radius of 200–800 km). The choice of averaged areas follows the current practice in operational statistical intensity prediction schemes (DeMaria et al., Reference DeMaria, Mainelli, Shay, Knaff and Kaplan2005). The distinction between outer-core and inner-core processes is justified because TC intensity is affected by environmental conditions in the storm’s neighborhood and internal processes within the storm (Sitkowski and Barnes, Reference Sitkowski and Barnes2009; Hendricks et al., Reference Hendricks, Braun, Vigh and Courtney2019). From an ML perspective, this preliminary dimensionality reduction removes features with high spatial correlations, reduces the complexity of the statistical models, and possibly improves model generalizability by removing some of the predictors’ spatial heterogeneity in different storms.
Once this preliminary dimensionality reduction is done, our goal is to eliminate spurious features, here defined based on meteorological variable, vertical (pressure) levels, time lag, and horizontal averaging sector (inner or outer core). We describe each TC using a total of $ {N}_X=234 $ time series of horizontally averaged 3D variables at given vertical levels and 2D variables (see Supplementary Table 1 for details). With regards to the time lags, we explore the time steps between 24 hr before the target (corresponding to $ {\tau}_{min}=8 $ ) and 72 hr before the target (corresponding to $ {\tau}_{max}=24 $ ). This results in a total number of 3,978 (234 potential predictors $ \times $ 17 time steps) for the causal algorithms, which eliminate the spurious links between the features and the targets. We randomly split the data by TC to avoid spatiotemporal correlation: Out of the selected 260 TCs, we randomly split 205 cases from 2001 to 2020 into a training set (150 cases) and validation set (55 cases) while keeping 55 cases from recent years (2017–2020) in the test set, without any overlaps. The regression task is to forecast three variables with a 1-day lead time, including (1) maximum wind speed at 10 m (max. 10 m wind, in m/s), (2) minimum sea-level pressure (MSLP, in hPa), and (3) horizontally integrated total precipitation (Tot. Intg. Precip. in $ {km}^2 $ ) accumulated over 3-hourly intervals. Maximum sustained wind speed at 10 m (averaged over 1 min, 3 min, or 10 min depending on the Regional Specialized Meteorological Centre) is the standard measurement for the intensity currently used operationally. We include MSLP as it correlates better with TC damage (Atkinson and Holliday, Reference Atkinson and Holliday1977; Kaplan et al., Reference Kaplan, DeMaria and Knaff2010). Additionally, MSLP is easier to estimate as it is an integrated quantity and only requires a couple of measurements near the storm center. Finally, we included total integrated precipitation as a potential target because most fatalities and damage from TCs are caused by heavy precipitation and storm surges (Rappaport, Reference Rappaport2014).
3.3. Causal machine learning
In this section, we describe the feature selection methodology in the context of TC intensity prediction. Our causal feature selection framework is shown in Figure 1. Once the four-dimensional fields have been reduced to time series, we align the time series in the training set based on the minimum pressure value recorded during each TC’s lifetime, which is a smoother measure of TC intensity than maximum surface wind speeds (Chavas et al., Reference Chavas, Reed and Knaff2017). Temporally aligning the multivariate time series of different ensemble members is key, as the resulting ensemble is more likely to satisfy the common causal structure assumption, improving prediction skills using causal feature selection. After aligning the time series, we feed the training set as inputs to the PC $ {}_1 $ and PCMCI algorithms (both implemented in Tigramite) to extract the most significant causal features. Here, an input feature may be defined as an environmental or derived variable (see Supplementary Table 1) at any given pressure level which is spatially averaged either by the inner or outer core radii at a given 3-hourly time-step. We stress that PC $ {}_1 $ and PCMCI are only applied to the training data. The implementation of both PC $ {}_1 $ and PCMCI contains several hyperparameters, including minimum and maximum time lags for the analysis, fixed to 1 day (8 timesteps) and 3 days (24 timesteps), respectively, for our prediction task. Further tunable hyperparameters are the significance level for conditional independence testing ( $ {pc}_{\alpha } $ ) and a significance threshold for the p-value matrix (alpha-level, only used for PCMCI), which control the selected number of features.
Once PC $ {}_1 $ and PCMCI have selected the most significant causal drivers of the targets from the ensemble of time series, these drivers are used as inputs to the prediction model. We logarithmically vary the values of $ {pc}_{\alpha } $ and alpha-level, which in effect controls the number of selected features, as more stringent ( $ {pc}_{\alpha } $ ) values will result in the selection of fewer and more significant features. From this set of experiments, we determine the best hyperparameters suitable for each target of interest by maximizing the validation performance of the trained regression models. We use multiple linear regression (MLR) and random forest (RF) regression models to predict the targets from the causally selected features. The MLR algorithms for causal-MLR experiments were prepared using the Scikit-Learn (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg and Vanderplas2011) implementation of the linear regression algorithm and its corresponding default parameters. Each MLR algorithm was trained to predict one of three unscaled target variables using the selected, standard-scaled features. We also included RF regression models using the same causal feature selection set-up (Figure 1) to explore the impact of causal feature selection for more complex nonlinear regression methods. We used the RF regressor from the sci-kit-learn library (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg and Vanderplas2011) to prepare the causal-RF and optimized its hyperparameters with a randomized search.
3.4. Noncausal machine-learning baselines
This study is motivated by the working hypothesis that regression models using causally selected features outperform noncausal baselines in terms of generalization. Here, noncausal baselines subsume both the case of no feature selection and the case of feature selection based on noncausal criteria. We compare our causal feature selection to noncausal feature selection methods such as lagged correlation, random selection as well as an explainable artificial intelligence (XAI) method of feature selection using RF regression (more details are provided in Section C of the Supplementary Material). To test our causal approach’s ability to effectively use time lags, we also train a long short-term memory (LSTM) network using all time lags between $ {\tau}_{min} $ and $ {\tau}_{max} $ and without feature selection. We implement the LSTM using the PyTorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga and Desmaison2019) library and conducted a hyperparameter search with the Optuna (Akiba et al., Reference Akiba, Sano, Yanase, Ohta and Koyama2019) framework. A more detailed description of the architecture is provided in Section C of the Supplementary Material.
4. Results
4.1. Performance of causal machine learning
Our first objective is to find the set of causal features that are best linked to the intensification in TCs at a lead time of 1-day. To measure the suitability of a set of causal features, we evaluate how MLR as well as RF models trained with the causally selected features perform when predicting TCs that are unseen during training. We evaluate prediction skill holistically (see Section D of the Supplementary Material), but focus on the coefficient of determination ( $ {R}^2 $ ) in the main text, with $ {R}^2=1 $ corresponding to a perfect prediction and $ {R}^2=0 $ to an error of one standard deviation. In Figure 2, we show the performance of Causal-ML models to predict the maximum winds, 24 hr in advance using M-PC $ {}_1 $ method. A less stringent significance threshold results in a larger set of features that are retained during training, which has a clear negative effect on the model validation performance. We found that feature selection using M-PCMCI is comparable to M-PC $ {}_1 $ when the minimum time lag is 6 hr (shown in Section B of the Supplementary Material), but the performance of M-PCMCI drastically deteriorates when we increase the minimum time lag to 1 day. Here, we only show the causal ML results based on M-PC $ {}_1 $ here. For comparison, similar experiments with reduced lead times where minimum and maximum time lags are set to 6 hr and 2 days, respectively, are shown in Section B of the Supplementary Material. PCMCI’s main advantage is to better control false positives in the presence of strong autocorrelation (Runge et al., Reference Runge, Nowack, Kretschmer, Flaxman and Sejdinovic2019b), which is more important for an actual causal discovery setting than for the causal prediction setting considered in this article.
Causal-MLR scores are better than those of noncausal MLRs (Figure 2a), which use all inputs without feature selection. The noncausal baselines clearly overfit the training set. The causal MLR is also compared with a recurrent neural network using an LSTM layer, and causal MLRs outperform the best LSTM model for all targets, which is remarkable given their simplicity. When comparing the RF models, we find that causal-RF scores are better than noncausal-RF for the validation set, whereas test set scores are comparable for wind speed predictions. In general, the ( $ {R}^2 $ ) values are highest for the predictions of MSLP, followed by wind and total integrated precipitation (Figure 2 and Supplementary Figures 1 and 2). The optimal set of causal drivers that performs well on the training and validation sets (without leading to overfitting) is sparse, containing only 31 features in the causal-MLR case for predicting maximum wind (Table 1). This result suggests that many of the features are spuriously linked to TC intensity and can be removed without sacrificing the predictive skill of simple MLR models compared to a noncausal-RF baseline. Similar results are obtained for the prediction of other targets, as shown in Supplementary Figures 1 and 2.
Note: Causal-MLR (bold) gives the best performance with the least features.
4.2. Comparison with feature selection baselines
Our second objective is to compare the performance of causal MLRs to MLRs with noncausal feature selection baselines (described in Section C of the Supplementary Material). For the maximum wind predictions (Figure 3), PC $ {}_1 $ consistently outperforms the two simpler feature selection baselines (random and lagged correlation), especially on the validation and test sets (Figure 3b,c). Lagged correlation, in particular, selects sets of predictors that perform very poorly in comparison, especially during validation. The ability of an XAI-based feature selection to capture nonlinearities seems to improve predictor selection, resulting in $ {R}^2 $ values that are almost comparable to the PC $ {}_1 $ causal feature selection method. Nevertheless, the causal PC $ {}_1 $ method retains an advantage for very sparse models (less than 50 input features), suggesting that the initial selection of causally relevant predictors allows these sparse models to beat the corresponding noncausal sparse models. PC $ {}_1 $ performs better than the other methods for the two other targets, which is shown in Section C of the Supplementary Material (Supplementary Figures 3–6). We note that lagged correlation performance is comparable to causal-MLR and XAI-based feature selection in predicting total integrated precipitation. This motivates adapting our conditional independence tests for non-normal distributions, which we leave for future work. The performance comparison based on ( $ {R}^2 $ ) of all best ML models used in the study, along with their number of input features, is listed in Table 1.
4.3. Optimal causal features
Here, we expand upon our results from the previous section to understand why causal-MLR models outperform the models that use other feature selection methods. For this purpose, we rely on the frequency of predictor selection (across the models) by the best-performing causal-MLR and lagged correlation MLR models.Footnote 3 We find that both methods choose different predictors for the maximum wind predictions while identifying inner core relative humidity as critical for wind prediction. However, divergence is a major predictor in the causal-MLR (Figure 4a), despite not being in the 10 most frequently selected predictors for the lagged-correlation models (Figure 4d). The vertical distribution (Figure 4b,e) and the time lag information (Figure 4c,f) of the most frequently chosen features reveal several differences in the causal models as compared to the lagged correlation models. Unlike the lag correlation method, the causal method selects features at specific vertical levels and time lags that are most informative to the prediction (Figure 4b) rather than placing importance over a wide range of vertical levels (Figure 4e) and lags. The PC $ {}_1 $ algorithm iteratively removes variables from the parent set $ pa\left({Y}_t\right) $ that are irrelevant or redundant for the prediction of the target, $ {Y}_t $ via conditional independence tests (here, based on partial correlation). PC $ {}_1 $ then ranks features based on significance test statistics, which gives a good measure of predictor relevance. Once $ {X}_t $ is in the parent set $ pa\left({Y}_t\right) $ , its neighbors will be iteratively removed because they are not conditionally independent of $ {X}_t $ . This confirms the interpretation that the selected time lags and vertical levels are “most predictive” of $ {Y}_t $ , and that the spatiotemporal neighbors of $ {X}_t $ are eliminated because of redundancy, which is due to the high spatiotemporal correlations in our dataset. Next, from a scientific viewpoint, the causal models clearly emphasize the low-level inner-core convergence (div), middle and upper tropospheric relative humidity (RH, rhum)in the inner core and the upper-level divergence (outdiv) in the outer core as most important predictors whereas the lagged correlation models rely on middle- and upper-tropospheric vertical motions (vvel) for the prediction task. Finally, in the time-lag plots (Figure 4c,f), the divergence links in the causal models are chosen at time lags of more than 2 days (−60 hr $ < $ t $ < $ −50 hr), while lagged correlation models focus on features at the shortest lead times (t $ > $ −48 hr) as they are more correlated with decreasing time lags.
Causal-MLR models rely on low-level convergence and upper-level divergence at longer time lags. In contrast, the lagged correlation MLR models mostly rely on mid- to upper-tropospheric vertical motion at shorter lead times. One way to interpret this is that the mid-tropospheric vertical motion could be a confounder, which is removed by the PC $ {}_1 $ algorithm. In this case, the difference in generalization skill may be attributed to the lagged correlation MLRs making predictions based on causally spurious links. The causal relationship involved here can be understood in mass adjustment terms: mass conservation requires low-level convergence and upper-level divergence to be balanced by upward mass transport. This upward motion can invigorate convection and aid TC intensification. Hence, the vertical motion should be considered as a consequence of divergence rather than an independent process that drives TC intensification by itself. We believe that the removal of mid-level vertical motion in the PC $ {}_1 $ features shows that causal discovery algorithms can successfully remove causally spurious links.
5. Conclusion
This article described a causal feature selection framework to predict and understand complex geophysical events that can be considered multiple realizations of the same process with small perturbations. We applied this framework to multiple TC time series to identify common causal links and used them as predictors in MLR and RF regression models. Our results show that causal feature selection is superior to traditional feature selection methods for finding sets of predictors that help regression models generalize to unseen TC cases, especially for very sparse models (Figure 3). Of the two causal methods, we find that the PC $ {}_1 $ algorithm is more appropriate for feature selection, as it only keeps the most informative features, effectively removes confounding features (e.g., mid-tropospheric vertical motion in Figure 4), and is less sensitive to the forecast lead time (Supplementary Figures 3–5) than PCMCI. Temporally aligning the multiple time series based on a common reference point before causal feature selection tangibly improves model prediction skills. The retention of spurious links in the lag correlation models negatively affected generalizability. From these observations, we conclude that causal feature selection holds potential in our continued effort to improve statistical TC intensity models. Future efforts will involve (a) assessing whether current operational intensity prediction baselines can be improved by the causality-based predictor selection; (b) expanding the study to all ocean basins; and (c) discovering new potential predictors that may improve operational TC intensity predictions.
While not studied in this article, the multidata causal discovery also opens the possibility to analyze systems whose causal structure changes in time: If one can align the individual member time series on a common time axis and can assume that, although changing in time, their causal structures are the same, then a dataset for independence testing can still be created by taking one sample per member time series. Repeating this procedure for every time step would yield one set of predictors per variable and per time step. Similarly, one could obtain one set of predictors per variable and per time window in a sequence of time windows (useful for slowly varying causal structures). Finally, we note that the multidata approach does not rely on any particular causal discovery algorithm. Therefore, while not shown here, the multidata approach can also be employed with the PCMCI $ {}^{+} $ algorithm (Runge, Reference Runge2020), a variant of PCMCI that allows contemporaneous causal influences and the latent-PCMCI (LPCMCI) algorithm (Gerhardus and Runge, Reference Gerhardus and Runge2020), a variant of PCMCI that allows for contemporaneous causal influences and latent confounders (available within the open-source Python package Tigramite). Lastly, one could further optimize predictions by selecting the subset of causal predictors with the highest (validation set-)skill as discussed from an information-theoretic perspective in Runge et al. (Reference Runge, Donner and Kurths2015a). In our context, however, iterating through all feature subsets is computationally prohibitive.
Acknowledgments
We thank the DCSR at UNIL for providing the computational resources and technical support.
Author contribution
Conceptualization: S.G.S.; T.B.; F.I.T., M.G.; Data curation: S.G.S.; F.I.T.; T.B.; Data visualization: S.G.S., F.I.T.; T.B.; Methodology: J.R.; A.G.; T.B.; S.G.S.; Writing original draft: S.G.S. All authors contributed to writing and review. All authors approved the revised manuscript.
Competing interest
The authors declare that no competing interests exist.
Data availability statement
The codes and tutorials for multidata causal discovery are freely available in the Tigramite GitHub repository and have been archived in Zenodo at https://doi.org/10.5281/zenodo.7747255. Sample code for the application are available at Causal-ML GitHub repository and have been archived in Zenodo at https://doi.org/10.5281/zenodo.7907217. The WPAC TC data are from the IBtrACS data archive. ERA5 datasets were downloaded from the Copernicus website (multiple pressure levels as well as single pressure levels).
Ethics statement
The research meets all ethical guidelines.
Funding statement
This research was supported by the canton of Vaud in Switzerland. J.R. has received funding from the European Research Council (ERC) Starting Grant CausalEarth under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 948112).
Provenance statement
This article is part of the Climate Informatics 2023 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer review process.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/eds.2023.21.