Pseudo-model-free hedging for variable annuities via deep reinforcement learning

Wing Fung Chong; Haoen Cui; Yuxuan Li

doi:10.1017/S1748499523000027

Pseudo-model-free hedging for variable annuities via deep reinforcement learning

Published online by Cambridge University Press: 14 March 2023

Wing Fung Chong ,

Haoen Cui and

Yuxuan Li

Show author details

Wing Fung Chong: Affiliation:
Maxwell Institute for Mathematical Sciences and Department of Actuarial Mathematics and Statistics, Heriot-Watt University, Edinburgh EH14 4AS, UK
Haoen Cui: Affiliation:
School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA
Yuxuan Li*: Affiliation:
Department of Mathematics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
*: *Corresponding author. E-mail: yuxuanl9@illinois.edu

Article contents

Abstract
Introduction
Problem Formulation and Motivation
Review of Reinforcement Learning
Illustrative Example Revisited: Training Phase
Online Learning Phase
Illustrative Example Revisited: Online Learning Phase
Methodological Assumptions and Implications in Practice
Concluding Remarks and Future Directions
Footnotes
References

Rights & Permissions

Abstract

This paper proposes a two-phase deep reinforcement learning approach, for hedging variable annuity contracts with both GMMB and GMDB riders, which can address model miscalibration in Black-Scholes financial and constant force of mortality actuarial market environments. In the training phase, an infant reinforcement learning agent interacts with a pre-designed training environment, collects sequential anchor-hedging reward signals, and gradually learns how to hedge the contracts. As expected, after a sufficient number of training steps, the trained reinforcement learning agent hedges, in the training environment, equally well as the correct Delta while outperforms misspecified Deltas. In the online learning phase, the trained reinforcement learning agent interacts with the market environment in real time, collects single terminal reward signals, and self-revises its hedging strategy. The hedging performance of the further trained reinforcement learning agent is demonstrated via an illustrative example on a rolling basis to reveal the self-revision capability on the hedging strategy by online learning.

Keywords

Two-phase deep reinforcement learning Variable annuities hedging Training phase Sequential anchor-hedging reward signals Online learning phase Single terminal reward signals Hedging strategy self-revision

Information

Type: Original Research Paper
Information: Annals of Actuarial Science , Volume 17 , Issue 3 , November 2023 , pp. 503 - 546

DOI: https://doi.org/10.1017/S1748499523000027 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries

1. Introduction

Variable annuities are long-term life products, in which policyholders participate in financial investments for profit sharing with insurers. Various guarantees are embedded in these contracts, such as guaranteed minimum maturity benefit (GMMB), guaranteed minimum death benefit (GMDB), guaranteed minimum accumulation benefit (GMAB), guaranteed minimum income benefit (GMIB), and guaranteed minimum withdrawal benefit (GMWB). According to the Insurance Information Institute in 2020, the sales of variable annuity contracts in the United States have amounted to, on average, $100.7$ billion annually, from 2016 to 2020.

Due to their popularity in the market and their dual-risk bearing nature, valuation and risk management of variable annuities have been substantially studied in the literature. By the risk-neutral option pricing approach, to name a few, (Milevsky & Posner Reference Milevsky and Posner2001) studied the valuation of the GMDB rider; valuation and hedging of the GMMB rider under the Black-Scholes (BS) financial market model were covered in Hardy (Reference Hardy2003); the GMWB rider was extensively investigated by Milevsky & Salisbury (Reference Milevsky and Salisbury2006), Dai et al. (Reference Dai, Kwok and Zong2008), and Chen et al. (Reference Chen, Vetzal and Forsyth2008); valuation and hedging of the GMMB rider were studied in Cui et al. (Reference Cui, Feng and MacKay2017) under the Heston financial market model; valuation of the GMMB rider, together with the feature that a contract can be surrendered before its maturity, was examined by Jeon & Kwak (Reference Jeon and Kwak2018), in which optimal surrender strategies were also provided. For a comprehensive review of this approach, see Feng (Reference Feng2018).

Valuation and risk management of variable annuities have recently been advanced via various approaches as well. Trottier et al. (Reference Trottier, Godin and Hamel2018) studied the hedging of variable annuities in the presence of basis risk based on a local optimisation method. Chong (Reference Chong2019) revisited the pricing and hedging problem of equity-linked life insurance contracts utilising the so-called principle of equivalent forward preferences. Feng & Yi (Reference Feng and Yi2019) compared the dynamic hedging approach to the stochastic reserving approach for the risk management of variable annuities. Moenig (Reference Moenig2021a) investigated the valuation and hedging problem of a portfolio of variable annuities via a dynamic programming method. Moenig (Reference Moenig2021b) explored the impact of market incompleteness on the policyholder’s behaviour. Wang & Zou (Reference Wang and Zou2021) solved the optimal fee structure for the GMDB and GMMB riders. Dang et al. (Reference Dang, Feng and Hardy2020) and (Reference Dang, Feng and Hardy2022) proposed and analysed efficient simulation methods for measuring the risk of variable annuities.

Recently, state-of-the-art machine learning methods have been deployed to revisit the valuation and hedging problems of variable annuities at a portfolio level. Gan (Reference Gan2013) proposed a three-step technique, by (i) selecting representative contracts with clustering method, (ii) pricing these contracts with Monte Carlo (MC) simulation, and (iii) predicting the value of the whole portfolio based on the values of representative contracts with kriging method. To further boost the efficiency and the effectiveness of selecting and pricing the representative contracts, as well as valuating the whole portfolio, various methods at each of these three steps have been proposed. For instance, Gan & Lin (Reference Gan and Lin2015) extended the ordinary kriging method to the universal kriging method; Hejazi & Jackson (Reference Hejazi and Jackson2016) used a neural network as the predictive model to valuate the whole portfolio; Gan & Valdez (Reference Gan and Valdez2018) implemented the generalised beta of the second kind method instead of the kriging method to capture the non-Gaussian behaviour of the market price of variable annuities. See also, Gan (Reference Gan2018), Gan & Valdez (Reference Gan and Valdez2020), Gweon et al. (Reference Gweon, Li and Mamon2020), Liu & Tan (Reference Liu and Tan2020), Lin & Yang (Reference Lin and Yang2020), Feng et al. (Reference Feng, Tan and Zheng2020), and Quan et al. (Reference Quan, Gan and Valdez2021) for recent developments in this three-step technique. Similar idea has also been applied to the calculation of Greeks and risk measures of a portfolio of variable annuities; see Gan & Lin (Reference Gan and Lin2017), Gan & Valdez (Reference Gan and Valdez2017), and Xu et al. (Reference Xu, Chen, Coleman and Coleman2018). All of the above literature applying the machine learning methods involve the supervised learning, which requires a pre-labelled dataset (in this case, it is the set of fair prices of the representative contracts) to train a predictive model.

Other than valuating and hedging variable annuities, supervised learning methods have also been applied to different actuarial contexts. Wüthrich (Reference Wüthrich2018) used a neural network for the chain-ladder factors in the chain-ladder claim reserving model to include heterogeneous individual claim features. Gao & Wüthrich (Reference Gao and Wüthrich2019) applied a convolutional neural network to classify drivers using their telematics data. Cheridito et al. (Reference Cheridito, Ery and Wüthrich2020) estimated the risk measures of a portfolio of assets and liabilities with a feedforward neural network. Richman & Wüthrich (Reference Richman and Wüthrich2021) and Perla et al. (Reference Perla, Richman, Scognamiglio and Wüthrich2021) studied the mortality rate forecasting problem, where Richman & Wüthrich (Reference Richman and Wüthrich2021) extended the traditional Lee-Carter model to multiple populations using a neural network, while Perla et al. (Reference Perla, Richman, Scognamiglio and Wüthrich2021) applied deep learning techniques directly on a time-series data of mortality rate. Hu et al. (Reference Hu, Quan and Chong2022) modified the loss function in tree-based models to improve the predictive performance when applying to imbalanced datasets which are common in the insurance practice.

Meanwhile, a flourishing sub-field in machine learning, called the reinforcement learning (RL), has been skyrocketing and has proved its powerfulness in various tasks; see Silver et al. (Reference Silver, Schrittwieser, Simonyan, Antonoglou, Huang, Guez, Hubert, Baker, Lai, Bolton, Chen, Lillicrap, Hui, Sifre, Van den Driessche, Graepel and Hassabis2017), and the references therein. Contrary to the supervised learning, the RL does not require a pre-labelled dataset for training. Instead, in the RL, an agent interacts with an environment, by sequentially observing states, taking, as well as revising, actions, and collecting rewards. Without possessing any prior knowledge of the environment, the agent needs to explore the environment while exploit the collected reward signals, for learning. For a representative monograph of RL, see Sutton & Barto (Reference Sutton and Barto2018); for its broad applications in economics, game theory, operations research, and finance, see the recent survey paper by Charpentier et al. (Reference Charpentier, Élie and Remlinger2021).

The mechanism of RL resembles how a hedging agent hedges any contingent claim dynamically. Indeed, the hedging agent could not know any specifics of the market environment, but could only observe states from the environmentFootnote ¹ , take a hedging strategy, and learn from reward signals to progressively improve the hedging strategy. However, in the context of hedging, if an insurer builds a hedging agent based on a certain RL method, called RL agent hereafter, and allows this infant RL agent to interact and learn from the market environment right away, the insurer could bear enormous financial loss while the infant RL agent is still exploring the environment before it could effectively exploit the reward signals. Moreover, provided that the insurer could not know any specifics of the market environment as well, they could not supply any information derived from theoretical models to the infant RL agent, and thus, the agent could only obtain the reward signals via the realised terminal profit and loss, based on the realised net liability and hedging portfolio value; these signals should not be effective for an infant RL agent to learn from the market environment.

To resolve these two issues above, we propose a two-phase (deep) RL approach, which is composed of a training phase and an online learning phase. In the training phase, based on their best knowledge of the market, the insurer constructs a training environment. An infant RL agent is then designated to interact and learn from this training environment for a period of time. Comparing to putting the infant RL agent in the market environment right away, the infant RL agent could be supplied by more information derived from the constructed training environment, such as the net liabilities before any terminal times. In this paper, we propose that the RL agent collects anchor-hedging reward signals during the training phase. After the RL agent is experienced with the training environment, in the online learning phase, the insurer finally designates the trained RL agent in the market environment. Again, since no theoretical model for the market environment is available to the insurer, the trained RL agent could only collect single terminal reward signals in this phase. In this paper, an illustrative example is provided to demonstrate the hedging performance using this approach.

All RL methods can be classified into either MC or temporal-difference (TD) learning. As a TD method shall be employed in this paper, in both the training and online learning phases, the following RL literature review focuses on the latter method. Sutton (Reference Sutton1984) and (Reference Sutton1988) first introduced the TD method for prediction of value function. Based upon their works, Watkins (Reference Watkins1989) and Watkins & Dayan (Reference Watkins and Dayan1992) proposed the well-known Q-learning for finite state and action spaces. Since then, the Q-learning has been improved substantially, in Hasselt (Reference Hasselt2010) for the Double Q-learning, and in Mnih et al. (Reference Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller2013), as well as Mnih et al. (Reference Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg and Hassabis2015), for the deep Q-learning which allows infinite state space. Any Q-learning approaches, or in general tabular solution methods and value function approximation methods, are only applicable to finite action space. However, in the context of hedging, the action space is infinite. Instead of discretising the action space, proximal policy optimisation (PPO) by Schulman et al. (Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017), which is a policy gradient method, shall be applied in this paper; our section 3.4 shall provide its self-contained review.

To the best of our knowledge, this paper is the first work to implement the RL algorithms with online learning to hedge contingent claims, particularly variable annuities. Contrary to Xu (Reference Xu2020) and Carbonneau (Reference Carbonneau2021), in which both adapted the state-of-the-art DH approach in Bühler et al. (Reference Bühler, Gonon, Teichmann and Wood2019), this paper is in line with the recent works by Kolm & Ritter (Reference Kolm and Ritter2019) and Cao et al. (Reference Cao, Chen, Hull and Poulos2021), while extends with actuarial components. We shall outline the differences between the RL and DH approaches throughout sections 3 and 4, as well as Appendices A and B. Kolm & Ritter (Reference Kolm and Ritter2019) discretised the action space and implemented RL algorithms for finitely many possible actions; however, as mentioned above, this paper does not discretise the action space but adapts the recently advanced policy gradient method, namely, the PPO. Comparing with Cao et al. (Reference Cao, Chen, Hull and Poulos2021), in addition to the actuarial elements, this paper puts forward online learning to self-revise the hedging strategy.

In the illustrative example, we assume that the market environment is the BS financial and constant force of mortality (CFM) actuarial markets, and the focus is on contracts with both GMMB and GMDB riders. Furthermore, we assume that the model of the market environment being presumed by the insurer, which shall be supplied as the training environment, is also the BS and the CFM, but with a different set of parameters. That is, while the insurer constructs correct dynamic models of the market environment for the training environment, the parameters in the model of the market environment are not the same as those in the market environment. Section 2.4 shall set the stage of this illustrative example and shall show that, if the insurer forwardly implements, in the market environment, the incorrect Delta hedging strategy based on their presumed model of the market environment, then its hedging performance for the variable annuities is worse than that by the correct Delta hedging strategy based on the market environment. In sections 4 and 6, this illustrative example shall be revisited using the two-phase RL approach. As we shall see in section 6, the hedging performance of the RL agent is even worse than that of the incorrect Delta, at the very beginning of hedging in real time. However, delicate analysis shows that, with a fair amount of future trajectories (which are different from simulated scenarios, with more details in section 6), the hedging performance of the RL agent becomes comparable with that of the correct Delta within a reasonable amount of time. Therefore, the illustrative example addresses model miscalibration issue in hedging variable annuity contracts with GMMB and GMDB riders in BS financial and CFM actuarial market environments, which is common in practice.

This paper is organised as follows. Section 2 formulates the continuous hedging problem for variable annuities, reformulates it to the discrete and Markov setting, and motivates as well as outlines the two-phase RL approach. Section 3 discusses the RL approach in hedging variable annuities and provides a self-contained review of RL, particularly the PPO, which is a TD policy gradient method, while section 5 presents the implementation details of the online learning phase. Sections 4 and 6 revisit the illustrative example in the training and online learning phases, respectively. Section 7 collates the assumptions of utilising the two-phase RL approach for hedging contingent claims, as well as their implications in practice. This paper finally concludes and comments on future directions in section 8.

2. Problem Formulation and Motivation

2.1. Classical hedging problem and model-based approach

We first review the classical hedging problem for variable annuities and its model-based solution to introduce some notations and to motivate the RL approach.

2.1.1. Actuarial and financial market models

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a rich enough complete probability space. Consider the current time $t=0$ and fix $T>0$ as a deterministic time in the future. Throughout this paper, all time units are in year.

There are one risk-free asset and one risky asset in the financial market. Let $B_t$ and $S_t$ , for $t\in\left[0,T\right]$ , be the time-t values of the risk-free asset and the risky asset, respectively. Let $\mathbb{G}^{\left(1\right)}=\left\{\mathcal{G}^{\left(1\right)}_t\right\}_{t\in\left[0,T\right]}$ be the filtration which contains all financial market information; in particular, both processes $B=\left\{B_t\right\}_{t\in\left[0,T\right]}$ and $S=\left\{S_t\right\}_{t\in\left[0,T\right]}$ are $\mathbb{G}^{\left(1\right)}$ -adapted.

There are N policyholders in the actuarial market. For each policyholder $i=1,2,\dots,N$ , denote $T_{x_i}^{\left(i\right)}$ as their random future lifetime, who is of age $x_i$ at the current time 0. Define, for each $i=1,2,\dots,N$ , and for any $t\geq 0$ , $J_t^{\left(i\right)}=\unicode{x1d7d9}_{\left\{T_{x_i}^{\left(i\right)}>t\right\}}$ , be the corresponding time-t jump value generated by the random future lifetime of the i-th policyholder; that is, if the i-th policyholder survives at some time $t\in\left[0,T\right]$ , $J_t^{\left(i\right)}=1$ ; otherwise, $J_t^{\left(i\right)}=0$ . Let $\mathbb{G}^{\left(2\right)}=\left\{\mathcal{G}^{\left(2\right)}_t\right\}_{t\in\left[0,T\right]}$ be the filtration which contains all actuarial market information; in particular, all single-jump processes $J^{\left(i\right)}=\left\{J_t^{\left(i\right)}\right\}_{t\in\left[0,T\right]}$ , for $i=1,2,\dots,N$ , are $\mathbb{G}^{\left(2\right)}$ -adapted.

Let $\mathbb{F}=\left\{\mathcal{F}_t\right\}_{t\in\left[0,T\right]}$ be the filtration which contains all actuarial and financial market information; that is, $\mathbb{F}=\mathbb{G}^{\left(1\right)}\vee\mathbb{G}^{\left(2\right)}$ . Therefore, the filtered probability space is given by $(\Omega, \mathcal{F}, \mathbb{F}, \mathbb{P})$ .

2.1.2. Variable annuities with guaranteed minimum maturity benefit and guaranteed minimum death benefit riders

At the current time 0, an insurer writes a variable annuity contract to each of these N policyholders. Each contract is embedded with both GMMB and GMDB riders. Assume that all these N contracts expire at the same fixed time T. In the following, fix a generic policyholder $i=1,2,\dots,N$ .

At the current time 0, the policyholder deposits $F_0^{\left(i\right)}$ into their segregated account to purchase $\rho^{\left(i\right)}>0$ shares of the risky asset; that is, $F_0^{\left(i\right)}=\rho^{\left(i\right)}S_0$ . Assume that the policyholder does not revise the number of shares $\rho^{\left(i\right)}$ throughout the effective time of the contract.

For any $t\in\left[0,T_{x_i}^{\left(i\right)}\wedge T\right]$ , the time-t segregated account value of the policyholder is given by $F_t^{\left(i\right)}=\rho^{\left(i\right)}S_te^{-m^{\left(i\right)}t}$ , where $m^{\left(i\right)}\in\!\left(0,1\right)$ is the continuously compounded annualised rate at which the asset-value-based fees are deducted from the segregated account by the insurer. For any $t\in\!\left(T_{x_i}^{\left(i\right)}\wedge T,T\right]$ , the time-t segregated account value $F_t^{\left(i\right)}$ must be 0; indeed, if the policyholder dies before the maturity, i.e. $T_{x_i}^{\left(i\right)}<T$ , then, due to the GMDB rider of a minimum guarantee $G_D^{\left(i\right)}>0$ , the beneficiary inherits $\max\left\{F^{\left(i\right)}_{T_{x_i}^{\left(i\right)}},G_D^{\left(i\right)}\right\}$ , which can be decomposed into $F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}+\left(G_D^{\left(i\right)}-F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}\right)_+$ , at the policyholder’s death time $T_{x_i}^{\left(i\right)}$ right away. Due to the GMMB rider of a minimum guarantee $G_M^{\left(i\right)}>0$ , if the policyholder survives beyond the maturity, i.e. $T_{x_i}^{\left(i\right)}>T$ , the policyholder acquires $\max\left\{F_T^{\left(i\right)},G_M^{\left(i\right)}\right\}$ at the maturity, which can be decomposed into $F_T^{\left(i\right)}+\left(G_M^{\left(i\right)}-F_T^{\left(i\right)}\right)_+$ .

2.1.3. Net liability of insurer

The liability of the insurer thus has two parts. The liability from the GMMB rider at the maturity for the i-th policyholder, where $i=1,2,\dots,N$ , is given by $\left(G_M^{\left(i\right)}-F_T^{\left(i\right)}\right)_+$ if the i-th policyholder survives beyond the maturity, and is 0 otherwise. The liability from the GMDB rider at the death time $T_{x_i}^{\left(i\right)}$ for the i-th policyholder, where $i=1,2,\dots,N$ , is given by $\left(G_D^{\left(i\right)}-F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}\right)_+$ if the i-th policyholder dies before the maturity, and is 0 otherwise. Therefore, at any time $t\in\left[0,T\right]$ , the future gross liability of the insurer accumulated to the maturity for these N contracts is given by

\begin{equation*}\sum_{i = 1}^{N}\!\left(\left(G_M^{\left(i\right)}-F_T^{\left(i\right)}\right)_+J_T^{\left(i\right)} + \frac{B_T}{B_{T_{x_i}^{\left(i\right)}}}\!\left(G_D^{\left(i\right)}-F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}\right)_+\unicode{x1d7d9}_{\{T_{x_i}^{\left(i\right)} \lt T\}}J_t^{\left(i\right)}\right).\end{equation*}

Denote $V_t^{\textrm{GL}}$ , for $t\in\left[0,T\right]$ , as the time-t value of the discounted (via the risk-free asset B) future gross liability of the insurer; if the liability is 0, the value will be 0.

From the asset-value-based fees collected by the insurer, a portion, known as the rider charge, is used to fund the liability due to the GMMB and GMDB riders; the remaining portion is used to cover overhead, commissions, and any other expenses. From the i-th policyholder, where $i=1,2,\dots,N$ , the insurer collects $m_e^{\left(i\right)}F_t^{\left(i\right)}J_t^{\left(i\right)}$ as the rider charge at any time $t\in\left[0,T\right]$ , where $m_e^{\left(i\right)}\in\!\left(0,m^{\left(i\right)}\right]$ . Therefore, the cumulative future rider charge to be collected, from any time $t\in\left[0,T\right]$ onward, till the maturity, by the insurer from these N policyholders, is given by $\sum_{i=1}^{N}\int_{t}^{T}m_e^{\left(i\right)}F_s^{\left(i\right)}J_s^{\left(i\right)}\!\left(B_T/B_s\right)ds$ . Denote $V_t^{\textrm{RC}}$ , for $t\in\left[0,T\right]$ , as its time-t discounted (via the risk-free asset B) value; if the cumulative rider charge is 0, the value will be 0.

Hence, due to these N variable annuity contracts with both GMMB and GMDB riders, for any $t\in\left[0,T\right]$ , the time-t net liability of the insurer for these N contracts is given by $L_t=V_t^{\textrm{GL}}-V_t^{\textrm{RC}}$ , which is $\mathcal{F}_t$ -measurable.

One of the many ways to set the rate $m^{\left(i\right)}\in\!\left(0,1\right)$ for the asset-value-based fees, and the rate $m_e^{\left(i\right)}\in\!\left(0,m^{\left(i\right)}\right]$ for the rider charge, for $i=1,2,\dots,N$ , is based on the time-0 net liability of the insurer for the i-th policyholder. More precisely, $m^{\left(i\right)}$ and $m_e^{\left(i\right)}$ are determined via $L_0^{\left(i\right)}=V_0^{\textrm{GL},\left(i\right)}-V_0^{\textrm{RC},\left(i\right)}=0$ , where $V_0^{\textrm{GL},\left(i\right)}$ and $V_0^{\textrm{RC},\left(i\right)}$ are the time-0 values of, respectively, the discounted future gross liability and the discounted cumulative future rider charge, of the insurer for the i-th policyholder.

2.1.4. Continuous hedging and hedging objective

The insurer aims to hedge this dual-risk bearing net liability via investing in the financial market. To this end, let $\tilde{T}$ be the death time of the last policyholder; that is, $\tilde{T}=\max_{i=1,2,\dots,N}T_{x_i}^{\left(i\right)}$ , which is random.

While the net liability $L_t$ is defined for any time $t\in\left[0,T\right]$ , as the difference between the values of discounted future gross liability and discounted cumulative future rider charge, $L_t=0$ for any $t\in\!\left(\tilde{T}\wedge T,T\right]$ . Indeed, if $\tilde{T}<T$ , then, for any $t\in\!\left(\tilde{T}\wedge T,T\right]$ , one has $T_{x_i}^{\left(i\right)}<t\leq T$ for all $i = 1,2,\dots, N$ , and hence, the future gross liability accumulated to the maturity, and the cumulative rider charge from time $\tilde{T}$ onward are both 0 so are their values. Therefore, the insurer only hedges the net liability $L_t$ , for any $t\in\left[0,\tilde{T}\wedge T\right]$ .

Let $H_t$ be the hedging strategy, i.e. the number of shares of the risky asset being held by the insurer, at time $t\in\left[0,T\right)$ . Hence, $H_t=0$ , for any $t\in\left[\tilde{T}\wedge T,T\right)$ . Let $\mathcal{H}$ be the admissible set of hedging strategies, which is defined by

\begin{align*}\mathcal{H}&=\left\{H=\left\{H_t\right\}_{t\in\left[0,T\right)}\;:\;(\textrm{i})\;H\textrm{ is }\mathbb{F}\textrm{-adapted, }(\textrm{ii})\;H\in\mathbb{R},\;\mathbb{P}\times\mathcal{L}\textrm{-a.s., and }\right.\\[5pt] & \qquad\qquad\qquad\qquad\; \left.(\textrm{iii})\;\textrm{for any }t\in\left[\tilde{T}\wedge T,T\right),\;H_t=0\right\},\end{align*}

where $\mathcal{L}$ is the Lebesgue measure on $\mathbb{R}$ . The condition (ii) indicates that there is not any constraint on the hedging strategies.

Let $P_t$ be the time-t value, for $t\in\left[0,T\right]$ , of the insurer’s hedging portfolio. Then, $P_0=0$ , and together with the rider charges collected from the N policyholders, as well as the withdrawal for paying the liabilities due to the beneficiaries’ inheritance from those policyholders who have already been dead, for any $t\in\!\left(0,T\right]$ ,

\begin{equation*}P_t=\int_{0}^{t}\!\left(P_s-H_sS_s\right)\frac{dB_s}{B_s}+\int_{0}^{t}H_sdS_s+\sum_{i=1}^{N}\int_{0}^{t}m_e^{\left(i\right)}F_s^{\left(i\right)}J_s^{\left(i\right)}ds -\sum_{i=1}^{N}\!\left(G_D^{\left(i\right)}-F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}\right)_+\unicode{x1d7d9}_{\{T_{x_i}^{\left(i\right)}\leq t \lt T\}},\end{equation*}

which obviously depends on $\left\{H_s\right\}_{s\in\left[0,t\right)}$ .

As in Bertsimas et al. (Reference Bertsimas, Kogan and Lo2000), the insurer’s hedging objective function at the current time 0 should be given by the root-mean-square error (RMSE) of the terminal profit and loss (P&L), which is, for any $H\in\mathcal{H}$ ,

\begin{equation*}\sqrt{\mathbb{E}^{\mathbb{P}}\left[\!\left(P_{\tilde{T}\wedge T}-L_{\tilde{T}\wedge T}\right)^2\right]}.\end{equation*}

If the insurer has full knowledge of the objective probability measure $\mathbb{P}$ , and hence the correct dynamics of the risk-free asset and the risky asset in the financial market, as well as the correct mortality model in the actuarial market, the optimal hedging strategy, being implemented forwardly, is given by minimising the RMSE of the terminal P&L:

\begin{equation*}H^*=\mathop{\textrm{arg min}}_{H\in\mathcal{H}}\sqrt{\mathbb{E}^{\mathbb{P}}\left[\!\left(P_{\tilde{T}\wedge T}-L_{\tilde{T}\wedge T}\right)^2\right]}.\end{equation*}

2.2. Pitfall of model-based approach

However, having correct model is usually not the case in practice. Indeed, the insurer, who is the hedging agent above, usually has little information regarding the objective probability measure $\mathbb{P}$ and hence easily misspecifies the financial market dynamics and the mortality model, which will in turn yield a poor performance from the supposedly optimal hedging strategy when it is implemented forwardly in the future. Section 2.4 outlines such an illustrative example which shall be discussed throughout the remaining of this paper.

To rectify this, we propose a two-phase (deep) RL approach to solve an optimal hedging strategy. In this approach, an RL agent, which is not the insurer themselves but is built by the insurer to hedge on their behalf, does not have any knowledge of the objective probability measure $\mathbb{P}$ , the financial market dynamics, and the mortality model; section 2.5 shall explain this approach in details. Before that, in the following section 2.3, the classical hedging problem shall first be reformulated with a Markov decision process (MDP) in a discrete time setting so that RL methods can be implemented. The illustrative example outlined in section 2.4 shall be revisited using the proposed two-phase RL approach in sections 4 and 6.

In the remaining of this paper, unless otherwise specified, all expectation operators shall be taken with respect to the objective probability measure $\mathbb{P}$ and denoted simply as $\mathbb{E}\!\left[\cdot\right]$ .

2.3. Discrete and Markov hedging

2.3.1. Discrete hedging and hedging objective

Let $t_0,t_1,\dots,t_{n-1}\in\left[0,T\right)$ , for some $n\in\mathbb{N}$ , be the time when the hedging agent decides the hedging strategy, such that $0=t_0<t_1<\dots<t_{n-1}<T$ . Denote also $t_n=T$ .

Let $t_{\tilde{n}}$ be the first time (right) after the last policyholder dies or all contracts expire, for some $\tilde{n}=1,2,\dots,n$ , which is random; that is, $t_{\tilde{n}}=\min\left\{t_k,\;k=1,2,\dots,n\;:\;t_k\geq\tilde{T}\right\}$ , and when $\tilde{T}>T$ , by convention, $\min\emptyset=t_n$ . Therefore, $H_t=0$ , for any $t=t_{\tilde{n}},t_{\tilde{n}+1},\dots,t_{n-1}$ . With a slight abuse of notation, the admissible set of hedging strategies in discrete time is

\begin{align*}\mathcal{H}=&\;\left\{H=\left\{H_t\right\}_{t=t_0,t_1,\dots,t_{n-1}}:\;(\textrm{i})\;\textrm{for any }t=t_{0},t_{1},\dots,t_{n-1},\;H_t\textrm{ is }\mathcal{F}_t\textrm{-measurable, }\right.\\[5pt] &\;\left.\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\;(\textrm{ii})\;\textrm{for any }t=t_{0},t_{1},\dots,t_{n-1},\;H_t\in\mathbb{R},\;\mathbb{P}\textrm{-a.s., and}\right.\\[5pt] &\;\left.\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\;(\textrm{iii})\;\textrm{for any }t=t_{\tilde{n}},t_{\tilde{n}+1},\dots,t_{n-1},\;H_t=0\right\};\end{align*}

again, the condition (ii) emphasises that no constraint is imposed to the hedging strategies.

While the hedging agent decides the hedging strategy at the discrete time points, the actuarial and financial market models are continuous. Hence, the net liability $L_t=V_t^{\textrm{GL}}-V_t^{\textrm{RC}}$ is still defined for any time $t\in\left[0,T\right]$ as before. Moreover, if $t\in\left[t_k,t_{k+1}\right)$ , for some $k=0,1,\dots,n-1$ , $H_t=H_{t_k}$ ; thus, $P_0=0$ , and, if $t\in\!\left(t_k,t_{k+1}\right]$ , for some $k=0,1,\dots,n-1$ ,

(1)

\begin{align}P_t&=\left(P_{t_k}-H_{t_k}S_{t_k}\right)\frac{B_{t}}{B_{t_k}}+H_{t_k}S_{t}+\sum_{i=1}^{N}\int_{t_k}^{t}m_e^{\left(i\right)}F_s^{\left(i\right)}J_s^{\left(i\right)}\frac{B_{t}}{B_{s}}ds \nonumber \\[5pt] & \quad -\sum_{i = 1}^{N}\frac{B_t}{B_{T_{x_i}^{\left(i\right)}}}\!\left(G_D^{\left(i\right)}-F_{T_{x_i}^{\left(i\right)}}^{\left(i\right)}\right)_+\unicode{x1d7d9}_{\{t_k<T_{x_i}^{\left(i\right)}\leq t<T\}}.\end{align}

For any $H\in\mathcal{H}$ , the hedging objective of the insurer at the current time 0 is $\sqrt{\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]}$ . Hence, the optimal discrete hedging strategy, being implemented forwardly, is given by

(2)

\begin{equation}H^*=\mathop{\textrm{arg min}}_{H\in\mathcal{H}}\sqrt{\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]}=\mathop{\textrm{arg min}}_{H\in\mathcal{H}}\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right].\end{equation}

2.3.2. Markov decision process

An MDP can be characterised by its state space, action space, Markov transition probability, and reward signal. In turn, these derive the value function and the optimal value function, which are equivalently known as, respectively, the objective function and the value function, in optimisation as in the previous sections. In the remaining of this paper, we shall adapt the MDP language.

(State) Let $\mathcal{X}$ be the state space in $\mathbb{R}^p$ , where $p\in\mathbb{N}$ . Each state in the state space represents a possible observation with p features in the actuarial and financial markets. Denote $X_{t_k}\in\mathcal{X}$ as the observed state at any time $t_k$ , where $k=0,1,\dots,n$ ; the state should minimally include an information related to the number of surviving policyholders $\sum_{i=1}^{N}J_{t_k}^{\left(i\right)}$ , and the term to maturity $T-t_k$ , in order to terminate the hedging at time $t_{\tilde{n}}$ , which is the first time when $\sum_{i=1}^{N}J_{t_{\tilde{n}}}^{\left(i\right)}=0$ , or which is when $T-t_{\tilde{n}}=0$ . The states (space) shall be specified in sections 4 and 5.
(Action) Let $\mathcal{A}$ be the action space in $\mathbb{R}$ . Each action in the action space is a possible hedging strategy. Denote $H_{t_k}\!\left(X_{t_k}\right)\in\mathcal{A}$ as the action at any time $t_k$ , where $k=0,1,\dots,n-1$ , which is assumed to be Markovian with respect to the observed state $X_{t_k}$ ; that is, given the current state $X_{t_k}$ , the current action $H_{t_k}\!\left(X_{t_k}\right)$ is independent of the past states $X_{t_0},X_{t_1},\dots,X_{t_{k-1}}$ . In the sequel, for notational simplicity, we simply write $H_{t_k}$ to represent $H_{t_k}\!\left(X_{t_k}\right)$ , for $k=0,1,\dots,n-1$ . If the feature of the number of surviving policyholders $\sum_{i=1}^{N}J_{t_k}^{\left(i\right)}=0$ , for $k=0,1,\dots,n-1$ , in the state $X_{t_k}$ , then $H_{t_k}=0$ ; in particular, for any $t_k$ , where $k=\tilde{n},\tilde{n}+1,\dots,n-1$ , the hedging strategy $H_{t_k}=0$ .
(Markov property) At any time $t_k$ , where $k=0,1,\dots,{n-1}$ , given the current state $X_{t_k}$ and the current hedging strategy $H_{t_k}$ , the transition probability distribution of the next state $X_{t_{k+1}}$ in the market is independent of the past states $X_{t_0},X_{t_1},\dots,X_{t_{k-1}}$ and the past hedging strategies $H_{t_0},H_{t_1},\dots,H_{t_{k-1}}$ ; that is, for any Borel set $\overline{B}\in\mathcal{B}\!\left(\mathcal{X}\right)$ ,
(3) \begin{equation}\mathbb{P}\!\left(X_{t_{k+1}}\in\overline{B}\vert H_{t_{k}},X_{t_{k}},H_{t_{k-1}},X_{t_{k-1}},\dots,H_{t_{1}},X_{t_{1}},H_{t_{0}},X_{t_{0}}\right)=\mathbb{P}\!\left(X_{t_{k+1}}\in\overline{B}\vert H_{t_{k}},X_{t_{k}}\right).\end{equation}
(Reward) At any time $t_k$ , where $k=0,1,\dots,{n-1}$ , given the current state $X_{t_k}$ in the market and the current hedging strategy $H_{t_k}$ , a reward signal $R_{t_{k+1}}\!\left(X_{t_k},H_{t_k},X_{t_{k+1}}\right)$ is received, by the hedging agent, as a result of transition to the next state $X_{t_{k+1}}$ . The reward signal shall be specified after introducing the (optimal) value function below. In the sequel, occasionally, for notational simplicity, we simply write $R_{t_{k+1}}$ to represent $R_{t_{k+1}}\!\left(X_{t_k},H_{t_k},X_{t_{k+1}}\right)$ , for $k=0,1,\dots,n-1$ .
(State, action, and reward sequence) The states, actions (which are hedging strategies herein), and reward signals form an episode, which is sequentially given by:
\begin{equation*}\left\{X_{t_0},H_{t_0},X_{t_1},R_{t_1},H_{t_1},X_{t_2},R_{t_2},H_{t_2},\dots,X_{t_{\tilde{n}-1}},R_{t_{\tilde{n}-1}},H_{t_{\tilde{n}-1}},X_{t_{\tilde{n}}},R_{t_{\tilde{n}}}\right\}.\end{equation*}
(Optimal value function) Based on the reward signals, the value function, at any time $t_k$ , where $k=0,1,\dots,n-1$ , with the state $x\in\mathcal{X}$ , is defined by, for any hedging strategies $H_{t_k},H_{t_{k+1}},\dots,H_{t_{n-1}}$ ,
(4) \begin{equation}V\!\left(t_k,x;\;H_{t_k},H_{t_{k+1}},\dots,H_{t_{n-1}}\right)=\mathbb{E}\!\left[\sum_{l=k}^{n-1}\gamma^{t_{l+1}-t_k}R_{t_{l+1}}\Big\vert X_{t_k}=x\right],\end{equation}
where $\gamma\in\left[0,1\right]$ is the discount rate; the value function, at the time $t_n=T$ with the state $x\in\mathcal{X}$ , is defined by $V\!\left(t_n,x\right)=0$ . Hence, the optimal discrete hedging strategy, being implemented forwardly, is given by
(5) \begin{equation}H^*=\mathop{\textrm{arg max}}_{H\in\mathcal{H}}\mathbb{E}\!\left[\sum_{k=0}^{n-1}\gamma^{t_{k+1}}R_{t_{k+1}}\Big\vert X_{0}=x\right].\end{equation}
In turn, the optimal value function, at any time $t_k$ , where $k=0,1,\dots,n-1$ , with the state $x\in\mathcal{X}$ , is
(6) \begin{equation}V^*\left(t_k,x\right)=V\!\left(t_k,x;\;H^*_{t_k},H^*_{t_{k+1}},\dots,H^*_{t_{n-1}}\right),\textrm{ and }V^*\left(t_n,x\right)=0.\end{equation}
(Reward engineering) To ensure the hedging problem being reformulated with the MDP, the value functions, given by that in (5), and the negative of that in (2), should coincide; that is,
(7) \begin{equation}\mathbb{E}\!\left[\sum_{k=0}^{n-1}\gamma^{t_{k+1}}R_{t_{k+1}}\Big\vert X_{0}=x\right]=-\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right].\end{equation}
Hence, two possible constructions for the reward signals are proposed as follows; each choice of the reward signals shall be utilised in one of the two phases in the proposed RL approach.

− (Single terminal reward) An obvious choice is to only have a reward signal from the negative squared terminal P&L; that is, for any time $t_k$ ,
(8)

Necessarily, the discount rate is given as $\gamma=1$ .
− (Sequential anchor-hedging reward) A less obvious choice is via telescoping the RHS of Equation (7), that
\begin{equation*}-\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]=-\mathbb{E}\!\left[\sum_{k=0}^{\tilde{n}-1}\!\left(\left(P_{t_{k+1}}-L_{t_{k+1}}\right)^2-\left(P_{t_{k}}-L_{t_{k}}\right)^2\right)+\left(P_0-L_0\right)^2\right].\end{equation*}
Therefore, when $L_0=P_0$ , another possible construction for the reward signal is, for any time $t_k$ ,
(9)

Again, the discount rate is necessarily given as $\gamma=1$ . The constructed reward in (9) outlines an anchor-hedging scheme. First, note that, at the current time 0, when $L_0=P_0$ , there is no local hedging error. Then, at each future hedging time before the last policyholder dies and before the maturity, the hedging performance is measured by the local squared P&L, i.e. $\left(P_{t_{k}}-L_{t_{k}}\right)^2$ , which serves as an anchor. At the next hedging time, if the local squared P&L is smaller than the anchor, it will be rewarded, i.e. $R_{t_{k+1}}>0$ ; however, if the local squared P&L becomes larger, it will be penalised, i.e. $R_{t_{k+1}}<0$ .

2.4. Illustrative example

The illustrative example below demonstrates the poor hedging performance by the Delta hedging strategy when the insurer miscalibrates the parameters in the market environment. We consider that the insurer hedges a variable annuity contract, with both GMMB and GMDB riders, of a single policyholder, i.e. $N=1$ , with the contract characteristics given in Table 1.

Table 1. Contract characteristics.

The market environment follows the Black-Scholes (BS) in the financial part and the constant force of mortality (CFM) in the actuarial front. The risk-free asset earns a constant risk-free interest rate $r>0$ that, for any $t\in\left[0,T\right]$ , $dB_t=rB_tdt$ , while the value of the risky asset evolves as a geometric Brownian motion that, for any $t\in\left[0,T\right]$ , $dS_t=\mu S_tdt+\sigma S_tdW_t$ , where $\mu$ is a constant drift, $\sigma>0$ is a constant volatility, and $W=\left\{W_t\right\}_{t\in\left[0,T\right]}$ is the standard Brownian motion. The random future lifetime of the policyholder $T_x$ has a CFM $\nu>0$ ; that is, for any $0\leq t\leq s\leq T$ , the conditional survival probability $\mathbb{P}\!\left(T_x, \gt s\vert T_x>t\right)=e^{-\nu\left(s-t\right)}$ . Moreover, the Brownian motion W in the financial market and the future lifetime $T_x$ in the actuarial market are independent. Table 2 summarises the parameters in the market environment. Note that the risk-free interest rate, the risky asset initial price, the initial age of the policyholder, and the investment strategy of the policyholder are observable by the insurer.

Table 2. Parameters setting of market environment.

Based on their best knowledge of the market, the insurer builds a model of the market environment. Suppose that the model happens to be the BS and the CFM as the market environment, but the insurer miscalibrates the parameters. Table 3 lists these parameters in the model of the market environment. In particular, the risky asset drift and volatility, as well as the force of mortality constant, are different from those in the market environment. For the observable parameters, they are the same as those in the market environment.

Table 3. Parameters setting of model of market environment, with bolded parameters being different from those in market environment.

At any time $t\in\left[0,T\right]$ , the value of the hedging portfolio of the insurer is given by (17), with $N=1$ , in which the values of the risky asset and the single-jump process follow the market environment with the parameters in Table 2. At any time $t\in\left[0,T\right]$ , the value of the net liability of the insurer is given by (16), with $N=1$ , in both the market environment and its model; for its detailed derivations, we defer it to section 4.1, as the model of the market environment, with multiple homogeneous policyholders for effective training, shall be supplied as the training environment. Since the parameters in the model of the market environment (see Table 3) are different from those in the market environment (see Table 2), the net liability evaluated by the insurer using the model is different from that of the market environment. There are two implications. Firstly, the Delta hedging strategy of the insurer using the parameters in Table 3 is incorrect, while the correct Delta hedging strategy should use the parameters in Table 2. Secondly, the asset-value-based fee m and the rider charge $m_e$ given in Table 4, which are determined by the insurer based on the time-0 value of their net liability by Table 3 via the method in section 2.1.3, are mispriced. They would not lead to zero time-0 value of their net liability in the market environment which is based on Table 2.

Table 4. Fee structures derived from model of market environment.

To evaluate the hedging performance of the incorrect Delta strategy by the insurer in the market environment for the variable annuity of contract characteristics in Table 1, 5,000 market scenarios using the parameters in Table 2 are simulated to realise terminal P&Ls. For comparison, the terminal P&Ls by the correct Delta hedging strategy are also obtained. Figure 1 shows the empirical density and cumulative distribution functions of the 5,000 realised terminal P&Ls by each Delta hedging strategy, while Table 5 outlines the summary statistics of the empirical distributions, in which $\widehat{\textrm{RMSE}}$ is the estimated RMSE of the terminal P&L similar to (2).

Table 5. Summary statistics of empirical distributions of realised terminal P&Ls by different Delta strategies.

Figure 1 Empirical density and cumulative distribution functions of realised terminal P&Ls by different Delta strategies.

In Figure 1(a), the empirical density function of realised terminal P&Ls by the incorrect Delta hedging strategy is depicted to be more heavy-tailed on the left than that by the correct Delta strategy. In fact, the terminal P&L by the incorrect Delta hedging strategy is stochastically dominated by that by the correct Delta strategy in the first-order; see Figure 1(b). Table 5 shows that the terminal P&L by the incorrect Delta hedging strategy has a mean and a median farther from zero, a higher standard deviation, larger left-tail risks in terms of Value-at-Risk and Tail Value-at-Risk, and a larger RMSE than that by the correct Delta strategy.

These observations conclude that, even in a market environment as simple as the BS and the CFM, the incorrect Delta hedging strategy based on the miscalibrated parameters by the insurer does not perform well when it is being implemented forwardly. In general, the hedging performance of model-based approaches depends crucially on the calibration of parameters for the model of the market environment.

2.5. Two-phase reinforcement learning approach

In an RL approach, at the current time 0, the insurer builds an RL agent to hedge on their behalf in the future. The agent interacts with a market environment, by sequentially observing states, taking, as well as revising, actions, which are the hedging strategies, and collecting rewards. Without possessing any prior knowledge of the market environment, the agent needs to, explore the environment while exploit the collected reward signals, for effective learning.

An intuitive proposition would be allowing an infant RL agent to learn directly from such market environment, like the one in section 2.4, moving forward. However, recall that the insurer actually does not know any exact market dynamics in the environment and thus is not able to provide any theoretical model for the net liability to the RL agent. In turn, the RL agent could not receive any sequential anchor-hedging reward signal in (9) from the environment, but instead receives the single terminal reward signal in (8). Since the rewards, except the terminal one, are all zero, the infant RL agent would learn ineffectively from such sparse rewards, i.e. the RL agent shall take a tremendous amount of time to finally learn a nearly optimal hedging strategy in the environment. Most importantly, while the RL agent is exploring and learning from the environment, which is not a simulated one, the insurer could suffer from huge financial burden due to any sub-optimal hedging performances.

In view of this, we propose that the insurer should first designate the infant RL agent to interact and learn from a training environment, which is constructed by the insurer based on their best knowledge of the market, for example, the model of the market environment in section 2.4. Since the training environment is known to the insurer (but is unknown to the RL agent), the RL agent can be supplied by a net liability theoretical model, and consequently learn from the sequential anchor-hedging reward signal in (9) of the training environment. Therefore, the infant RL agent would be guided by the net liability to learn effectively from the local hedging errors. After interacting and learning from the training environment for a period of time, in order to gauge the effectiveness, the RL agent shall be tested for its hedging performance in simulated scenarios from the same training environment. This first phase is called the training phase.

Training Phase:

(i) The insurer constructs the MDP training environment.
(ii) The insurer builds the infant RL agent which uses the PPO algorithm.
(iii) The insurer assigns the RL agent in the MDP training environment to interact and learn for a period of time, during which the RL agent collects the anchor-hedging reward signal in (9).
(iv) The insurer deploys the trained RL agent to hedge in simulated scenarios from the same training environment and documents the baseline hedging performance.

If the hedging performance of the trained RL agent in the training environment is satisfactory, the insurer should then proceed to assign it to interact and learn from the market environment. Since the training and market environments are usually different, such as having different parameters as in section 2.4, the initial hedging performance of the trained RL agent in the market environment is expected to diverge from the fine baseline hedging performance in the training environment. However, different from an infant RL agent, the trained RL agent is experienced so that the sparse reward signal in (8) should be sufficient for the agent to revise the hedging strategy, from the nearly optimal one in the training environment to that in the market environment, within a reasonable amount of time. This second phase is called the online learning phase.

Online Learning Phase:

(v) The insurer assigns the RL agent in the market environment to interact and learn in real time, during which the RL agent collects the single terminal reward signal in (8).

These summarise the proposed two-phase RL approach. Figure 2 depicts the above sequence clearly. There are several assumptions underneath this two-phase RL approach in order to apply it effectively to a hedging problem of a contingent claim; as they involve specifics in later sections, we collate their discussions and elaborate their implications in practice in section 7. In the following section, we shall briefly review the training essentials of RL in order to introduce the PPO algorithm. For the details of online learning phase, we defer them until section 5.

Figure 2 The relationship among insurer, RL agent, MDP training environment, and market environment of the two-phase RL approach.

3. Review of Reinforcement Learning

3.1. Stochastic action for exploration

One of the fundamental ideas in RL is that, at any time $t_k$ , where $k=0,1,\dots,{n-1}$ , given the current state $X_{t_k}$ , the RL agent does not take a deterministic action $H_{t_k}$ but extends it to a stochastic action, in order to explore the MDP environment and in turn learn from the reward signals. The stochastic action is sampled through a so-called policy, which is defined below.

Let $\mathcal{P}\!\left(\mathcal{A}\right)$ be a set of probability measures over the action space $\mathcal{A}$ ; each probability measure $\mu\left(\cdot\right)\in\mathcal{P}\!\left(\mathcal{A}\right)$ maps a Borel set $\overline{A}\in\mathcal{B}\!\left(\mathcal{A}\right)$ to $\mu\left(\overline{A}\right)\in\left[0,1\right]$ . The policy $\pi\!\left(\cdot\right)$ is a mapping from the state space $\mathcal{X}$ to the set of probability measures $\mathcal{P}\!\left(\mathcal{A}\right)$ ; that is, for any state $x\in\mathcal{X}$ , $\pi\!\left(x\right)=\mu\left(\cdot\right)\in\mathcal{P}\!\left(\mathcal{A}\right)$ . The value function and the optimal value function, at any time $t_k$ , where $k=0,1,\dots,\tilde{n}-1$ , with the state $x\in\mathcal{X}$ , are then generalised as, for any policy $\pi\!\left(\cdot\right)$ ,

(10)

\begin{equation}V\!\left(t_k,x;\;\pi\!\left(\cdot\right)\right)=\mathbb{E}\!\left[\sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}\Big\vert X_{t_k}=x\right],\quad V^*\left(t_k,x\right)=\sup_{\pi\!\left(\cdot\right)}V\!\left(t_k,x;\;\pi\!\left(\cdot\right)\right);\end{equation}

at any time $t_k$ , where $k=\tilde{n},\tilde{n}+1,\dots,n-1$ , with the state $x\in\mathcal{X}$ , for any policy $\pi\!\left(\cdot\right)$ , $V\!\left(t_k,x;\;\pi\!\left(\cdot\right)\right)=V^*\left(t_k,x\right)=0$ . In particular, if $\mathcal{P}\!\left(\mathcal{A}\right)$ contains only all Dirac measures over the action space $\mathcal{A}$ , which is the case in the DH approach of Bühler et al. (Reference Bühler, Gonon, Teichmann and Wood2019) (see Appendix A for more details), the value function and the optimal value function reduce to (4) and (6). With this relaxed setting, solving the optimal hedging strategy $H^*$ boils down to finding the optimal policy $\pi^*\!\left(\cdot\right)$ .

3.2. Policy approximation and parameterisation

As the hedging problem has the infinite action space $\mathcal{A}$ , tabular solution methods for problems of finite state space and finite action space (such as Q-learning), or value function approximation methods for problems of infinite state space and finite action space (such as deep Q-learning) are not suitable. Instead, a policy gradient method is employed.

To this end, the policy $\pi\!\left(\cdot\right)$ is approximated and parametrised by the weights $\theta_{\textrm{p}}$ in an artificial neural network (ANN); in turn, denote the policy by $\pi\!\left(\cdot;\;\theta_{\textrm{p}}\right)$ . The ANN $\mathcal{N}_{\textrm{p}}\!\left(\cdot;\;\theta_{\textrm{p}}\right)$ (to be defined in (11) below) takes a state $x\in\mathcal{X}$ as the input vector and output parameters of a probability measure in $\mathcal{P}\!\left(\mathcal{A}\right)$ . In the sequel, the set $\mathcal{P}\!\left(\mathcal{A}\right)$ contains all Gaussian measures (see, for example, Wang et al. Reference Wang, Zariphopoulou and Zhou2020 and Wang & Zhou Reference Wang and Zhou2020), in which each has a mean c and a variance $d^2$ , which depend on the state input $x\in\mathcal{X}$ and the ANN weights $\theta_{\textrm{p}}$ . Therefore, for any state $x\in\mathcal{X}$ ,

\begin{equation*}\pi\!\left(x;\;\theta_{\textrm{p}}\right)=\mu\left(\cdot;\;\theta_{\textrm{p}}\right)\sim\textrm{Gaussian}\!\left(c\!\left(x;\;\theta_{\textrm{p}}\right),d^2\left(x;\;\theta_{\textrm{p}}\right)\right),\end{equation*}

where $\left(c\!\left(x;\;\theta_{\textrm{p}}\right),d^2\!\left(x;\;\theta_{\textrm{p}}\right)\right)=\mathcal{N}_{\textrm{p}}\!\left(x;\;\theta_{\textrm{p}}\right)$ .

With such approximation and parameterisation, solving the optimal policy $\pi^*$ further boils down to finding the optimal ANN weights $\theta^*_{\textrm{p}}$ . Hence, denote the value function and the optimal value function in (10) by $V\!\left(t_k,x;\;\theta_{\textrm{p}}\right)$ and $V\!\left(t_k,x;\;\theta^*_{\textrm{p}}\right)$ , for any $t_k$ , where $k=0,1,\dots,\tilde{n}-1$ , with $x\in\mathcal{X}$ . However, the (optimal) value function still depends on the objective probability measure $\mathbb{P}$ , the financial market dynamics, and the mortality model, which are unknown to the RL agent. Before formally introducing the policy gradient methods to tackle this issue, we shall first explicitly construct the ANNs for the approximated policy, as well as for an estimate of the value function (to prepare the algorithm of policy gradient method to be reviewed below).

3.3. Network architecture

As alluded above, in this paper, the ANN involves two parts, which are the policy network and the value function network.

3.3.1. Policy network

Let $N_{\textrm{p}}$ be the number of layers for the policy network. For $l=0,1,\dots,N_{\textrm{p}}$ , let $d_{\textrm{p}}^{\left(l\right)}$ be the dimension of the l-th layer, where the 0-th layer is the input layer; the $1,2,\dots,\left(N_{\textrm{p}}-1\right)$ -th layers are hidden layers; the $N_{\textrm{p}}$ -th layer is the output layer. In particular, $d_{\textrm{p}}^{\left(0\right)}=p$ , which is the number of features in the actuarial and financial parts, and $d_{\textrm{p}}^{\left(N_{\textrm{p}}\right)}=2$ , which outputs the mean c and the variance $d^2$ of the Gaussian measure. The policy network $\mathcal{N}_{\textrm{p}}\;:\;\mathbb{R}^{p}\rightarrow\mathbb{R}^{2}$ is defined as, for any $x\in\mathbb{R}^{p}$ ,

(11)

\begin{equation}\mathcal{N}_{\textrm{p}}\!\left(x\right)=\left(W_{\textrm{p}}^{\left(N_{\textrm{p}}\right)}\circ\psi\circ W_{\textrm{p}}^{\left(N_{\textrm{p}}-1\right)}\circ\psi\circ W_{\textrm{p}}^{\left(N_{\textrm{p}}-2\right)}\circ\dots\circ\psi\circ W_{\textrm{p}}^{\left(1\right)}\right)\left(x\right),\end{equation}

where, for $l=1,2,\dots,N_{\textrm{p}}$ , the mapping $W_{\textrm{p}}^{\left(l\right)}\;:\;\mathbb{R}^{d_{\textrm{p}}^{\left(l-1\right)}}\rightarrow\mathbb{R}^{d_{\textrm{p}}^{\left(l\right)}}$ is affine, and the mapping $\psi\;:\;\mathbb{R}^{d_{\textrm{p}}^{\left(l\right)}}\rightarrow\mathbb{R}^{d_{\textrm{p}}^{\left(l\right)}}$ is a componentwise activation function. Let $\theta_{\textrm{p}}$ be the parameter vector of the policy network and in turn denote the policy network in (11) by $\mathcal{N}_{\textrm{p}}\!\left(x;\;\theta_{\textrm{p}}\right)$ , for any $x\in\mathbb{R}^p$ .

3.3.2. Value function network

The value function network is constructed similarly as in the policy network, except that all subscripts p (policy) are replaced by v (value). In particular, the value function network $\mathcal{N}_{\textrm{v}}\;:\;\mathbb{R}^{p}\rightarrow\mathbb{R}$ is defined as, for any $x\in\mathbb{R}^{p}$ ,

(12)

\begin{equation}\mathcal{N}_{\textrm{v}}\!\left(x\right)=\left(W_{\textrm{v}}^{\left(N_{\textrm{v}}\right)}\circ\psi\circ W_{\textrm{v}}^{\left(N_{\textrm{v}}-1\right)}\circ\psi\circ W_{\textrm{v}}^{\left(N_{\textrm{v}}-2\right)}\circ\dots\circ\psi\circ W_{\textrm{v}}^{\left(1\right)}\right)\left(x\right),\end{equation}

which models an approximated value function $\hat{V}$ (see section 3.4 below). Let $\theta_{\textrm{v}}$ be the parameter vector of the value function network and in turn denote the value function network in (12) by $\mathcal{N}_{\textrm{v}}\!\left(x;\;\theta_{\textrm{v}}\right)$ , for any $x\in\mathbb{R}^p$ .

3.3.3. Shared layers structure

Since the policy and value function networks should extract features from the input state vector in a similar manner, they are assumed to share the first few layers. More specifically, let $N_{\textrm{s}}\!\left(<\min\left\{N_{\textrm{p}},N_{\textrm{v}}\right\}\right)$ be the number of shared layers for the policy and value function networks; for $l=1,2,\dots,N_{\textrm{s}}$ , $W_{\textrm{p}}^{\left(l\right)}=W_{\textrm{v}}^{\left(l\right)}=W_{\textrm{s}}^{\left(l\right)}$ , and hence, for any $x\in\mathbb{R}^{p}$ ,

\begin{equation*}\mathcal{N}_{\textrm{p}}\!\left(x;\;\theta_{\textrm{p}}\right)=\left(W_{\textrm{p}}^{\left(N_{\textrm{p}}\right)}\circ\psi\circ W_{\textrm{p}}^{\left(N_{\textrm{p}}-1\right)}\circ\dots\circ\psi\circ W_{\textrm{p}}^{\left(N_{\textrm{s}}+1\right)}\circ\psi\circ W_{\textrm{s}}^{\left(N_{\textrm{s}}\right)}\circ\dots\circ\psi\circ W_{\textrm{s}}^{\left(1\right)}\right)\left(x\right),\end{equation*}

\begin{equation*}\mathcal{N}_{\textrm{v}}\!\left(x;\;\theta_{\textrm{v}}\right)=\left(W_{\textrm{v}}^{\left(N_{\textrm{v}}\right)}\circ\psi\circ W_{\textrm{v}}^{\left(N_{\textrm{v}}-1\right)}\circ\dots\circ\psi\circ W_{\textrm{v}}^{\left(N_{\textrm{s}}+1\right)}\circ\psi\circ W_{\textrm{s}}^{\left(N_{\textrm{s}}\right)}\circ\dots\circ\psi\circ W_{\textrm{s}}^{\left(1\right)}\right)\left(x\right).\end{equation*}

Let $\theta$ be the parameter vector of the policy and value function networks. Figure 3 depicts such a shared layers structure.

Figure 3 An example of policy and value function artificial neural networks with a shared hidden layer and a non-shared hidden layer.

3.4. Proximal policy optimisation: a temporal-difference policy gradient method

A policy gradient method entails that, starting from initial ANN weights $\theta^{\left(0\right)}$ , and via interacting with the MDP environment to observe the states and collect the reward signals, the RL agent gradually updates the ANN weights, by the (stochastic) gradient ascent on a certain surrogate performance measure defined for the ANN weights. That is, at each update step $u=1,2,\dots$ ,

(13)

\begin{equation}\theta^{\left(u\right)}=\theta^{\left(u-1\right)}+\alpha\widehat{\nabla_{\theta}\mathcal{J}^{\left(u-1\right)}\!\left(\theta^{\left(u-1\right)}\right)},\end{equation}

where the hyperparameter $\alpha\in\left[0,1\right]$ is the learning rate of the RL agent, and, based on the experienced episode(s), $\widehat{\nabla_{\theta}\mathcal{J}^{\left(u-1\right)}\!\left(\theta^{\left(u-1\right)}\right)}$ is the estimated gradient of the surrogate performance measure $\mathcal{J}^{\left(u-1\right)}\!\left(\cdot\right)$ evaluating at $\theta=\theta^{\left(u-1\right)}$ .

REINFORCE, which is pioneered by Williams (Reference Williams1992), is a Monte Carlo policy gradient method, which updates the ANN weights by each episode. As this paper applies a temporal-difference (TD) policy gradient method, we relegate the review of REINFORCE to Appendix B, where the Policy Gradient Theorem, the foundation of any policy gradient methods, is presented.

PPO, which is pioneered by Schulman et al. (Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017), is a TD policy gradient method, which updates the ANN weights by a batch of $K\in\mathbb{N}$ realisations. At each update step $u=1,2,\dots$ , based on the ANN weights $\theta^{\left(u-1\right)}$ , and thus the policy $\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ , the RL agent experiences $E^{\left(u\right)}\in\mathbb{N}$ realised episodes for the K realisations.

If $E^{\left(u\right)}=1$ , the episode is given by
\begin{align*}&\;\left\{\dots,x_{t_{K_s^{\left(u\right)}}}^{\left(u-1\right)},h_{t_{K_s^{\left(u\right)}}}^{\left(u-1\right)},x_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1\right)},r_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1\right)},h_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{t_{K_s^{\left(u\right)}+K-1}}^{\left(u-1\right)},r_{t_{K_s^{\left(u\right)}+K-1}}^{\left(u-1\right)},h_{t_{K_s^{\left(u\right)}+K-1}}^{\left(u-1\right)},x_{t_{K_s^{\left(u\right)}+K}}^{\left(u-1\right)},r_{t_{K_s^{\left(u\right)}+K}}^{\left(u-1\right)},\dots\right\},\end{align*}
where $K_s^{\left(u\right)}=0,1,\dots,\tilde{n}-1$ , such that the time $t_{K_s^{\left(u\right)}}$ is when the episode is initiated in this update, and $h_{t_k}^{\left(u-1\right)}$ , for $k=0,1,\dots,\tilde{n}-1$ , is the time $t_k$ realised hedging strategy being sampled from the Gaussian distribution with the mean $c\!\left(x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ and the variance $d^2\left(x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ ; necessarily, $\tilde{n}-K_s^{\left(u\right)}\geq K$ .
If $E^{\left(u\right)}=2,3,\dots$ , the episodes are given by
\begin{align*}&\;\left\{\dots,x_{t_{K_s^{\left(u\right)}}}^{\left(u-1,1\right)},h_{t_{K_s^{\left(u\right)}}}^{\left(u-1,1\right)},x_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1,1\right)},r_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1,1\right)},h_{t_{K_s^{\left(u\right)}+1}}^{\left(u-1,1\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{t_{\tilde{n}^{\left(1\right)}-1}}^{\left(u-1,1\right)},r_{t_{\tilde{n}^{\left(1\right)}-1}}^{\left(u-1,1\right)},h_{t_{\tilde{n}^{\left(1\right)}-1}}^{\left(u-1,1\right)},x_{t_{\tilde{n}^{\left(1\right)}}}^{\left(u-1,1\right)},r_{t_{\tilde{n}^{\left(1\right)}}}^{\left(u-1,1\right)}\right\},\\[5pt] &\;\left\{x_{t_0}^{\left(u-1,2\right)},h_{t_0}^{\left(u-1,2\right)},x_{t_1}^{\left(u-1,2\right)},r_{t_1}^{\left(u-1,2\right)},h_{t_1}^{\left(u-1,2\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{t_{\tilde{n}^{\left(2\right)}-1}}^{\left(u-1,2\right)},r_{t_{\tilde{n}^{\left(2\right)}-1}}^{\left(u-1,2\right)},h_{t_{\tilde{n}^{\left(2\right)}-1}}^{\left(u-1,2\right)},x_{t_{\tilde{n}^{\left(2\right)}}}^{\left(u-1,2\right)},r_{t_{\tilde{n}^{\left(2\right)}}}^{\left(u-1,2\right)}\right\},\\[5pt] &\;\dots,\end{align*}

\begin{align*}&\;\left\{x_{t_0}^{\left(u-1,E^{\left(u\right)}-1\right)},h_{t_0}^{\left(u-1,E^{\left(u\right)}-1\right)},x_{t_1}^{\left(u-1,E^{\left(u\right)}-1\right)},r_{t_1}^{\left(u-1,E^{\left(u\right)}-1\right)},h_{t_1}^{\left(u-1,E^{\left(u\right)}-1\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{t_{\tilde{n}^{\left(E^{\left(u\right)}-1\right)}-1}}^{\left(u-1,E^{\left(u\right)}-1\right)},r_{t_{\tilde{n}^{\left(E^{\left(u\right)}-1\right)}-1}}^{\left(u-1,E^{\left(u\right)}-1\right)},h_{t_{\tilde{n}^{\left(E^{\left(u\right)}-1\right)}-1}}^{\left(u-1,E^{\left(u\right)}-1\right)},x_{t_{\tilde{n}^{\left(E^{\left(u\right)}-1\right)}}}^{\left(u-1,E^{\left(u\right)}-1\right)},r_{t_{\tilde{n}^{\left(E^{\left(u\right)}-1\right)}}}^{\left(u-1,E^{\left(u\right)}-1\right)}\right\},\\[5pt] &\;\left\{x_{t_0}^{\left(u-1,E^{\left(u\right)}\right)},h_{t_0}^{\left(u-1,E^{\left(u\right)}\right)},x_{t_1}^{\left(u-1,E^{\left(u\right)}\right)},r_{t_1}^{\left(u-1,E^{\left(u\right)}\right)},h_{t_1}^{\left(u-1,E^{\left(u\right)}\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{t_{K_f^{\left(u\right)}-1}}^{\left(u-1,E^{\left(u\right)}\right)},r_{t_{K_f^{\left(u\right)}-1}}^{\left(u-1,E^{\left(u\right)}\right)},h_{t_{K_f^{\left(u\right)}-1}}^{\left(u-1,E^{\left(u\right)}\right)},x_{t_{K_f^{\left(u\right)}}}^{\left(u-1,E^{\left(u\right)}\right)},r_{t_{K_f^{\left(u\right)}}}^{\left(u-1,E^{\left(u\right)}\right)},\dots\right\},\end{align*}
where $K_f^{\left(u\right)}=1,2,\dots,\tilde{n}^{\left(E^{\left(u\right)}\right)}$ , such that the time $t_{K_f^{\left(u\right)}}$ is when the last episode is finished (but not necessarily terminated) in this update; necessarily, $\tilde{n}^{\left(1\right)}-K_s^{\left(u\right)}+\sum_{e=2}^{E^{\left(u\right)}-1}\tilde{n}^{\left(e\right)}+K_f^{\left(u\right)}=K$ .

The surrogate performance measure of PPO consists of three components. In the following, fix an update step $u=1,2,\dots$ .

Inspired by Schulman et al. (Reference Schulman, Levine, Moritz, Jordan and Abbeel2015), in which the time-0 value function difference between two policies is shown to be equal to the expected advantage, together with importance sampling and KL divergence constraint reformulation, the first component in the surrogate performance measure of PPO is given by:

if $E^{\left(u\right)}=1$ ,
\begin{equation*}L_{\textrm{CLIP}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{K_s^{\left(u\right)}+K-1}\min\left\{q_{t_k}^{\left(u-1\right)}\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k},\textrm{clip}\!\left(q_{t_k}^{\left(u-1\right)},1-\epsilon,1+\epsilon\right)\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}\right\}\right],\end{equation*}
where the importance sampling ratio $q_{t_k}^{\left(u-1\right)}=\frac{\phi\left(H^{\left(u-1\right)}_{t_k};\;X^{\left(u-1\right)}_{t_k},\theta_{\textrm{p}}\right)}{\phi\left(H^{\left(u-1\right)}_{t_k};\;X^{\left(u-1\right)}_{t_k},\theta^{\left(u-1\right)}_{\textrm{p}}\right)}$ , in which $\phi\left(\cdot;\;X^{\left(u-1\right)}_{t_k},\theta_{\textrm{p}}\right)$ is the Gaussian density function with mean $c\!\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)$ and variance $d^2\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)$ , the estimated advantage is evaluated at $\theta_{\textrm{p}}=\theta^{\left(u-1\right)}_{\textrm{p}}$ and bootstrapped through the approximated value function that
\begin{equation*}\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}=\begin{cases}\sum_{l=k}^{K_s^{\left(u\right)}+K-1}R_{t_{l+1}}^{\left(u-1\right)}+\hat{V}\!\left(t_{K_s^{\left(u\right)}+K},X^{\left(u-1\right)}_{t_{K_s^{\left(u\right)}+K}};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)&\\[5pt] \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\;-\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)&\textrm{if }K_s^{\left(u\right)}+K<\tilde{n},\\[5pt] \sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}^{\left(u-1\right)}-\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)&\textrm{if }K_s^{\left(u\right)}+K=\tilde{n},\\[5pt] \end{cases}\end{equation*}
and the function $\textrm{clip}\!\left(q_{t_k}^{\left(u-1\right)},1-\epsilon,1+\epsilon\right)=\min\left\{\max\left\{q_{t_k}^{\left(u-1\right)},1-\epsilon\right\},1+\epsilon\right\}$ . The approximated value function $\hat{V}$ is given by the output of the value network, i.e. $\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)=\mathcal{N}_{\textrm{v}}\!\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)$ as defined in (12) for $k=0,1,\dots,\tilde{n}-1$ .
if $E^{\left(u\right)}=2,3,\dots$ ,
\begin{align*}&\;L_{\textrm{CLIP}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{\tilde{n}^{\left(1\right)}-1}\min\left\{q_{t_k}^{\left(u-1,1\right)}\hat{A}^{\left(u-1,1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k},\textrm{clip}\!\left(q_{t_k}^{\left(u-1,1\right)},1-\epsilon,1+\epsilon\right)\hat{A}^{\left(u-1,1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}\right\}\right.\\[5pt] &\;\left.+\sum_{e=2}^{E^{\left(u\right)}-1}\sum_{k=0}^{\tilde{n}^{\left(e\right)}-1}\min\left\{q_{t_k}^{\left(u-1,e\right)}\hat{A}^{\left(u-1,e\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k},\textrm{clip}\!\left(q_{t_k}^{\left(u-1,e\right)},1-\epsilon,1+\epsilon\right)\hat{A}^{\left(u-1,e\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}\right\}\right.\\[5pt] &\;\left.+\sum_{k=0}^{K_f^{\left(u\right)}-1}\min\left\{q_{t_k}^{\left(u-1,E^{\left(u\right)}\right)}\hat{A}^{\left(u-1,E^{\left(u\right)}\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k},\textrm{clip}\!\left(q_{t_k}^{\left(u-1,E^{\left(u\right)}\right)},1-\epsilon,1+\epsilon\right)\hat{A}^{\left(u-1,E^{\left(u\right)}\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}\right\}\right].\end{align*}

Similar to REINFORCE in Appendix B, the second component in the surrogate performance measure of PPO minimises the loss between the bootstrapped sum of reward signals and the approximated value function. To this end, define:

if $E^{\left(u\right)}=1$ ,
\begin{equation*}L_{\textrm{VF}}^{\left(u-1\right)}\!\left(\theta_{\textrm{v}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{K_s^{\left(u\right)}+K-1}\!\left(\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}+\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)-\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}\right)\right)^2\right];\end{equation*}
if $E^{\left(u\right)}=2,3,\dots$ ,
\begin{align*}&\;L_{\textrm{VF}}^{\left(u-1\right)}\!\left(\theta_{\textrm{v}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{\tilde{n}^{\left(1\right)}-1}\!\left(\hat{A}^{\left(u-1,1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}+\hat{V}\!\left(t_k,X^{\left(u-1,1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)-\hat{V}\!\left(t_k,X^{\left(u-1,1\right)}_{t_k};\;\theta_{\textrm{v}}\right)\right)^2\right.\\[5pt] &\;\left.+\sum_{e=2}^{E^{\left(u\right)}-1}\sum_{k=0}^{\tilde{n}^{\left(e\right)}-1}\!\left(\hat{A}^{\left(u-1,e\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}+\hat{V}\!\left(t_k,X^{\left(u-1,e\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)-\hat{V}\!\left(t_k,X^{\left(u-1,e\right)}_{t_k};\;\theta_{\textrm{v}}\right)\right)^2\right.\\[5pt] &\;\left.+\sum_{k=0}^{K_f^{\left(u\right)}-1}\!\left(\hat{A}^{\left(u-1,E^{\left(u\right)}\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}+\hat{V}\!\left(t_k,X^{\left(u-1,E^{\left(u\right)}\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)-\hat{V}\!\left(t_k,X^{\left(u-1,E^{\left(u\right)}\right)}_{t_k};\;\theta_{\textrm{v}}\right)\right)^2\right].\end{align*}

Finally, to encourage the RL agent exploring the MDP environment, the third component in the surrogate performance measure of PPO is the entropy bonus. Based on the Gaussian density function, define

if $E^{\left(u\right)}=1$ ,
\begin{equation*}L_{\textrm{EN}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{K_s^{\left(u\right)}+K-1}\ln d\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)\right];\end{equation*}
if $E^{\left(u\right)}=2,3,\dots$ ,
\begin{align*}&\;L_{\textrm{EN}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=K_s^{\left(u\right)}}^{\tilde{n}^{\left(1\right)}-1}\ln d\left(X^{\left(u-1,1\right)}_{t_k};\;\theta_{\textrm{p}}\right)+\sum_{e=2}^{E^{\left(u\right)}-1}\sum_{k=0}^{\tilde{n}^{\left(e\right)}-1}\ln d\left(X^{\left(u-1,e\right)}_{t_k};\;\theta_{\textrm{p}}\right)\right.\\[5pt] &\;\left.\quad\quad\quad\quad\quad\quad\quad\quad+\sum_{k=0}^{K_f^{\left(u\right)}-1}\ln d\left(X^{\left(u-1,E^{\left(u\right)}\right)}_{t_k};\;\theta_{\textrm{p}}\right)\right].\end{align*}
Therefore, the surrogate performance measure of PPO is given by:
(14) \begin{equation}\mathcal{J}^{\left(u-1\right)}\!\left(\theta\right)=L_{\textrm{CLIP}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right)-c_1L_{\textrm{VF}}^{\left(u-1\right)}\!\left(\theta_{\textrm{v}}\right)+c_2L_{\textrm{EN}}^{\left(u-1\right)}\!\left(\theta_{\textrm{p}}\right),\end{equation}

where the hyperparameters $c_1,c_2\in\left[0,1\right]$ are the loss coefficients of the RL agent. Its estimated gradient, based on the K realisations, is then computed via automatic differentiation; see, for example, Baydin et al. (Reference Baydin, Pearlmutter, Radul and Siskind2018).

4. Illustrative Example Revisited: Training Phase

Recall that, in the training phase, the insurer constructs a model of the market environment for an MDP training environment, while the RL agent, which does not know any specifics of this MDP environment, observes states and receives the anchor-hedging reward signals in (9) from it and hence gradually learns the hedging strategy by the PPO algorithm reviewed in the last section. This section revisits the illustrative example in section 2.4 via the two-phase RL approach in the training phase.

4.1. Markov decision process training environment

The model of the market environment is the BS and the CFM in the financial and the actuarial parts. However, unlike the model following the market environment to write a single contract to a single policyholder, for effective training, the insurer writes identical contracts to N homogeneous policyholders in the training environment. Because of the homogeneity of the contracts and the policyholders, for all $i=1,2,\dots,N$ , $x_i=x$ , $\rho^{\left(i\right)}=\rho$ , $m^{\left(i\right)}=m$ , $G_M^{\left(i\right)}=G_M$ , $G_D^{\left(i\right)}=G_D$ , $m_e^{\left(i\right)}=m_e$ , and $F_t^{\left(i\right)}=F_t=\rho S_te^{-mt}$ , for $t\in\left[0,T\right]$ .

At any time $t\in\left[0,T\right]$ , the future gross liability of the insurer accumulated to the maturity is thus $\left(G_M-F_T\right)_+\sum_{i = 1}^{N}J_T^{\left(i\right)} +\sum_{i = 1}^{N} e^{r\left(T - T_{x}^{\left(i\right)} \right)}\!\left(G_D-F_{T_{x}^{\left(i\right)}}\right)_+\unicode{x1d7d9}_{\{T_{x}^{\left(i\right)} \lt T\}}J_{t}^{\left(i\right)},$ and its time-t discounted value is

\begin{align*}V_t^{\textrm{GL}}&=e^{-r\left(T-t\right)}\mathbb{E}^{\mathbb{Q}}\left[\left(G_M-F_T\right)_+\sum_{i=1}^{N}J_T^{\left(i\right)}\Big\vert\mathcal{F}_t\right] \\[5pt] & \quad +\mathbb{E}^{\mathbb{Q}}\left[\sum_{i=1}^{N}e^{-r\left(T_{x}^{\left(i\right)}-t\right)}\!\left(G_D-F_{T_{x}^{\left(i\right)}}\right)_+\unicode{x1d7d9}_{\{T_{x}^{\left(i\right)} \lt T\}}J_{t}^{\left(i\right)}\Big\vert\mathcal{F}_t\right]\\[5pt] \end{align*}

\begin{align*} &=e^{-r\left(T-t\right)}\mathbb{E}^{\mathbb{Q}}\left[\left(G_M-F_T\right)_+\vert\mathcal{F}_t\right]\sum_{i=1}^{N}\mathbb{E}^{\mathbb{Q}}\left[J_T^{\left(i\right)}\big\vert\mathcal{F}_t\right] \\[5pt] &\quad +\sum_{i=1}^{N}J_t^{\left(i\right)}\mathbb{E}^{\mathbb{Q}}\left[e^{-r\left(T_{x}^{\left(i\right)}-t\right)}\!\left(G_D-F_{T_{x}^{\left(i\right)}}\right)_+\unicode{x1d7d9}_{\{T_{x}^{\left(i\right)} \lt T\}}\Big\vert\mathcal{F}_t\right],\end{align*}

where the probability measure $\mathbb{Q}$ defined on $\left(\Omega,\mathcal{F}\right)$ is an equivalent martingale measure with respect to $\mathbb{P}$ . Herein, the probability measure $\mathbb{Q}$ is chosen to be the product measure of each individual equivalent martingale measure in the actuarial or financial part, which implies the independence among the Brownian motion W and the future lifetime $T_x^{\left(1\right)},T_x^{\left(2\right)},\dots,T_x^{\left(N\right)}$ , clarifying the first term in the second equality above. The second term in that equality is due to the fact that, for $i=1,2,\dots,N$ , the single-jump process $J^{\left(i\right)}$ is $\mathbb{F}$ -adapted. Under the probability measure $\mathbb{Q}$ , all future lifetime are identically distributed and have a CFM $\nu>0$ , which are the same as those under the probability measure $\mathbb{P}$ in section 2.4. Therefore, for any $i=1,2,\dots,N$ , and for any $0\leq t\leq s\leq T$ , the conditional survival probability $\mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}>t\right)=e^{-\nu\left(s-t\right)}$ . For each policyholder $i=1,2,\dots,N$ , by the independence and the Markov property, for any $0\leq t\leq s\leq T$ ,

(15)

\begin{equation}\mathbb{E}^{\mathbb{Q}}\left[J_s^{\left(i\right)}\big\vert\mathcal{F}_t\right]=\mathbb{E}^{\mathbb{Q}}\left[J_s^{\left(i\right)}\big\vert J_t^{\left(i\right)}\right]=\begin{cases}\mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}\leq t\right)=0 & \textrm{if}\quad T_x^{\left(i\right)}\!\left(\omega\right)\leq t\\[5pt] \mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}>t\right)=e^{-\nu\left(s-t\right)} & \textrm{if}\quad T_x^{\left(i\right)}\!\left(\omega\right)>t\\[5pt] \end{cases}.\end{equation}

Moreover, under the probability measure $\mathbb{Q}$ , for any $t\in\left[0,T\right]$ , $dF_t=\left(r-m\right)F_tdt+\sigma F_tdW_t^{\mathbb{Q}}$ , where $W^{\mathbb{Q}}=\left\{W^{\mathbb{Q}}_t\right\}_{t\in\left[0,T\right]}$ is the standard Brownian motion under the probability measure $\mathbb{Q}$ . Hence, the time-t value of the discounted future gross liability, for $t\in\left[0,T\right]$ , is given by

\begin{align*}V_t^{\textrm{GL}}=&\;e^{-\nu\left(T-t\right)}\!\left(G_Me^{-r\left(T-t\right)}\Phi\left(-d_2\left(t,G_M\right)\right)-F_te^{-m\left(T-t\right)}\Phi\left(-d_1\left(t,G_M\right)\right)\right)\sum_{i=1}^{N}J_t^{\left(i\right)}\\[5pt] &\;+\int_{t}^{T}\!\left(G_De^{-r\left(T-s\right)}\Phi\left(-d_2\left(s,G_D\right)\right)-F_te^{-m\left(T-s\right)}\Phi\left(-d_1\left(s,G_D\right)\right)\right)\nu e^{-\nu\left(s-t\right)}ds\sum_{i=1}^{N}J_t^{\left(i\right)}, \end{align*}

where, for $s \in \left[0,T\right)$ and $G \gt 0$ , $d_1\left(s,G\right)=\frac{\ln\left(\frac{F_s}{G}\right)+\left(r-m+\frac{\sigma^2}{2}\!\left(T-s\right)\right)}{\sigma\sqrt{T-s}}$ , $d_2\left(s,G\right)=d_1\left(s,G\right)-\sigma\sqrt{T-s}$ , $d_1\left(T,G\right) = \lim_{s\rightarrow T^{-}}d_1\left(s,G\right)$ , $d_2\left(T,G\right) = d_1\left(T,G\right)$ , and $\Phi\left(\cdot\right)$ is the standard Gaussian distribution function. Note that $\sum_{i=1}^{N}J_t^{\left(i\right)}$ represents the number of surviving policyholders at time $t\in\left[0,T\right]$ .

As for the cumulative future rider charge to be collected by the insurer from any time $t\in\left[0,T\right]$ onward, it is given by $\sum_{i=1}^{N}\int_{t}^{T}m_eF_sJ_s^{\left(i\right)}e^{r(T-s)}ds$ , and its time-t discounted value is

\begin{equation*}V_t^{\textrm{RC}}=e^{-r\left(T-t\right)}\mathbb{E}^{\mathbb{Q}}\!\left[\!\sum_{i=1}^{N}\int_{t}^{T}m_eF_sJ_s^{\left(i\right)}e^{r(T-s)}ds\Big\vert\mathcal{F}_t\!\right]=\sum_{i=1}^{N}\int_{t}^{T}m_ee^{-r\left(s-t\right)}\mathbb{E}^{\mathbb{Q}}\left[F_s\vert F_t\right]\mathbb{E}^{\mathbb{Q}}\left[J_s^{\left(i\right)}\big\vert J_t^{\left(i\right)}\right]ds,\end{equation*}

where the second equality is again due to the independence and the Markov property. Under the probability measure $\mathbb{Q}$ , $\mathbb{E}^{\mathbb{Q}}\left[F_s\vert F_t\right]=e^{\left(r-m\right)\left(s-t\right)}F_t$ . Together with (15),

\begin{equation*}V_t^{\textrm{RC}}=\frac{1-e^{-\left(m+\nu\right)\left(T-t\right)}}{m+\nu}m_eF_t\sum_{i=1}^{N}J_t^{\left(i\right)}.\end{equation*}

Therefore, the time-t net liability of the insurer, for $t\in\left[0,T\right]$ , is given by

(16)

\begin{align}L_t=V_t^{\textrm{GL}}-V_t^{\textrm{RC}}&=\Bigg(e^{-\nu\left(T-t\right)}\!\left(G_Me^{-r\left(T-t\right)}\Phi\left(-d_2\left(t,G_M\right)\right)-F_te^{-m\left(T-t\right)}\Phi\left(-d_1\left(t,G_M\right)\right)\right)\nonumber \\[5pt] &\quad+\int_{t}^{T}\!\left(G_De^{-r\left(T-s\right)}\Phi\left(-d_2\left(s,G_D\right)\right)-F_te^{-m\left(T-s\right)}\Phi\left(-d_1\left(s,G_D\right)\right)\right) \nonumber \\[5pt] &\quad \; \nu e^{-\nu\left(s-t\right)}ds-\frac{1-e^{-\left(m+\nu\right)\left(T-t\right)}}{m+\nu}m_eF_t\Bigg)\sum_{i=1}^{N}J_t^{\left(i\right)}, \end{align}

which contributes parts of the reward signals in (9). The time-t value of the insurer’s hedging portfolio, for $t\in\left[0,T\right]$ , as in (1), is given by: $P_0=0$ , and if $t\in\!\left(t_k,t_{k+1}\right]$ , for some $k=0,1,\dots,n-1$ ,

(17)

\begin{align}P_t&=\left(P_{t_k}-H_{t_k}S_{t_k}\right)e^{r\left(t-t_k\right)}+H_{t_k}S_{t}+m_e\int_{t_k}^{t}F_se^{r\left(t-s\right)}\sum_{i=1}^{N}J_s^{\left(i\right)}ds\nonumber \\[5pt] &\quad -\sum_{i = 1}^{N}e^{r\left(t-T_{x}^{\left(i\right)}\right)}\!\left(G_D-F_{T_{x}^{\left(i\right)}}\right)_+\unicode{x1d7d9}_{\{t_k<T_{x}^{\left(i\right)}\leq t<T\}},\end{align}

which is also supplied to the reward signals in (9).

At each time $t_k$ , where $k=0,1,\dots,n$ , the RL agent is given to observe four features from this MDP environment; these four features are summarised in the state vector

(18)

\begin{equation}X_{t_k}=\left(\ln{F_{t_k}},\frac{P_{t_k}}{N},\frac{\sum_{i=1}^{N}J^{\left(i\right)}_{t_k}}{N},T-t_k\right).\end{equation}

The first feature is the natural logarithm of the segregated account value of the policyholder. The second feature is the hedging portfolio value of the insurer, being normalised by the initial number of policyholders. The third feature is the ratio of the number of surviving policyholders with respect to the initial number of policyholders. These features are either log-transformed or normalised to prevent the RL agent from exploring and learning from features with high variability. The last feature is the term to maturity. In particular, when either the third or the last feature first hits zero, i.e. at time $t_{\tilde{n}}$ , an episode is terminated. The state space $\mathcal{X}=\mathbb{R}\times\mathbb{R}\times\left[0,1/N, 2/N,\dots, 1\right]\times\left\{0,t_1,t_2,\dots,T\right\}$ .

Recall that, at each time $t_k$ , where $k=0,1,\dots,\tilde{n}-1$ , with the state vector (18) being the input, the output of the policy network in (11) is the mean $c\!\left(X_{t_k};\;\theta_{\textrm{p}}\right)$ and the variance $d^2\left(X_{t_k};\;\theta_{\textrm{p}}\right)$ of a Gaussian measure; herein, the Gaussian measure represents the distribution of the average number of shares of the risky asset being held by the insurer at the time $t_k$ for each surviving policyholder. Hence, for $k=0,1,\dots,\tilde{n}-1$ , the hedging strategy $H_{t_k}$ in (17) is given by $H_{t_k}=\overline{H}_{t_k}\sum_{i=1}^{N}J^{\left(i\right)}_{t_k}$ , where $\overline{H}_{t_k}$ is sampled from the Gaussian measure. Since the hedging strategy is assumed to be Markovian with respect to the state vector, it can be shown, albeit tedious, that the state vector, in (18), and the hedging strategy together satisfy the Markov property in (3).

Also recall that the infant RL agent is trained in the MDP environment with multiple homogeneous policyholders. The RL agent should then effectively update the ANN weights $\theta$ and learn the hedging strategies, via a more direct inference on the force of mortality from the third feature in the state vector. The RL agent hedges daily, so that the difference between the consecutive discrete hedging time is $\delta t_{k}=t_{k+1}-t_{k}=\frac{1}{252}$ , for $k=0,1,\dots,n-1$ . In this MDP training environment, the parameters of the model are given in Table 3, but with $N=500$ .

4.2. Building reinforcement learning agent

After constructing this MDP training environment, the insurer builds the RL agent which implements the PPO, which was reviewed in section 3.4. Table 6(a) summarises all hyperparameters of the implemented PPO, in which three of them are determined via grid searchFootnote ² , while the remaining two are fixed a priori since they alter the surrogate performance measure itself, and thus should not be based on grid search. Table 6(b) outlines the hyperparameters of the ANN architecture in section 3.3, which are all pre-specified, in which ReLU stands for Rectified Linear Unit; that is, the componentwise activation function is given by, for any $z\in\mathbb{R}$ , $\psi\!\left(z\right)=\max\left\{z,0\right\}$ .

Table 6. Hyperparameters setting of Proximal Policy Optimisation and neural network.

4.3. Training of reinforcement learning agent

With all these being set up, the insurer assigns the RL agent experiencing this MDP training environment, in order to observe the state, decide, as well as revise, the hedging strategy, and collect the anchor-hedging reward signal based on (9), as much as possible. Let $\mathcal{U}\in\mathbb{N}$ be the number of update steps in the training environment on the ANN weights. Hence, the policy of the experienced RL agent is given by $\pi\!\left(\cdot;\;\theta^{\left(\mathcal{U}\right)}\right)=\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(\mathcal{U}\right)}\right)$ .

Figure 4 depicts the training log of the RL agent in terms of bootstrapped sum of rewards and batch entropy. In particular, Figure 4(a) shows that the value function in (2) reduces to almost zero after around $10^8$ training timesteps, which is equivalent to around 48,828 update steps for the ANN weights; within the same number of training timesteps, Figure 4(b) illustrates a gradual depletion on the batch entropy, and hence, the Gaussian measure gently becomes more concentrating around its mean, which implies that the RL agent progressively diminishes the degree of exploration on the MDP training environment, while increases the degree of exploitation on the learned ANN weights.

Figure 4 Training log in terms of bootstrapped sum of rewards and batch entropy.

4.4. Baseline hedging performance

In the final step of the training phase, the trained RL agent is assigned to hedge in simulated scenarios from the same MDP training environment, except that $N=1$ which is in line with hedging in the market environment. The trained RL agent takes the deterministic action $c\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(\mathcal{U}\right)}\right)$ which is the mean of the Gaussian measure.

The number of simulated scenarios is 5,000. For each scenario, the insurer documents the realised terminal P&L, i.e. $P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}$ . After all scenarios are experienced by the trained RL agent, the insurer examines the baseline hedging performance via the empirical distribution and the summary statistics of the realised terminal P&Ls. The baseline hedging performance of the RL agent is also benchmarked with those by other methods, namely the classical Deltas and the DH; see Appendix C for the implemented hyperparameters of the DH training. The following four classical Deltas are implemented in the simulated scenarios from the training environment, in which the (in)correctness of the Deltas are with respect to the training environment:

(correct) Delta of the CFM actuarial and BS financial models with the model parameters as in Table 3;
(incorrect) Delta of the increasing force of mortality (IFM) actuarial and BS financial models, where, for any $i=1,2,\dots,N$ , if $T<\overline{b}$ , the conditional survival probability $\mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}>t\right)=\frac{\overline{b}-s}{\overline{b}-t}$ , for any $0\leq t\leq s\leq T<\overline{b}$ , while if $\overline{b}\leq T$ , the conditional survival probability $\mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}>t\right)=\frac{\overline{b}-s}{\overline{b}-t}$ , for any $0\leq t\leq s<\overline{b}\leq T$ , and $\mathbb{Q}\!\left(T_x^{\left(i\right)}>s\vert T_x^{\left(i\right)}>t\right)=0$ , for any $0\leq t\leq\overline{b}\leq s\leq T$ or $0\leq\overline{b}\leq t\leq s\leq T$ , with the model parameters as in Tables 3(a) and 7;
(incorrect) Delta in the CFM actuarial and Heston financial models, where, for any $t\in\left[0,T\right]$ , $dS_t=\mu S_tdt+\sqrt{\Sigma_t}S_tdW_t^{\left(1\right)}$ , $d\Sigma_t=\kappa\left(\overline{\Sigma}-\Sigma_t\right)dt+\eta\sqrt{\Sigma_t}dW_t^{\left(2\right)}$ , and $\left\langle W^{(1)},W^{(2)}\right\rangle_t=\phi t$ , with the model parameters as in Tables 3(b) and 8;
(incorrect) Delta in the IFM actuarial and Heston financial models with the model parameters as in Tables 7 and 8.

Table 7. Parameters setting of increasing force of mortality actuarial model for Delta.

Table 8. Parameters setting of Heston financial model for Delta.

Figure 5 shows the empirical density and cumulative distribution functions via the 5,000 realised terminal P&Ls by each hedging approach, while Table 9 outlines the summary statistics of these empirical distributions. To clearly illustrate the comparisons, Figure 6 depicts the empirical density functions via the 5,000 pathwise differences of the realised terminal P&Ls between the RL agent and each of the other approaches, while Table 10 lists the summary statistics of the empirical distributions; for example, comparing with the DH approach, the pathwise difference of the realised terminal P&Ls for the e-th simulated scenario, for $e=1,2,\dots,5,000$ , is calculated by $\left(P_{t_{\tilde{n}}}^{\textrm{RL}}\!\left(\omega_e\right)-L_{t_{\tilde{n}}}^{\textrm{RL}}\!\left(\omega_e\right)\right)-\left(P_{t_{\tilde{n}}}^{\textrm{DH}}\!\left(\omega_e\right)-L_{t_{\tilde{n}}}^{\textrm{DH}}\!\left(\omega_e\right)\right)$ .

Figure 5 Empirical density and cumulative distribution functions of realised terminal P&Ls by the approaches of reinforcement learning, classical Deltas, and deep hedging.

As expected, the baseline hedging performance of the trained RL agent in this training environment is comparable with those by, the correct CFM and BS Delta, as well as the DH approach. Moreover, the RL agent outperforms all the other three incorrect Deltas, which are based on either incorrect IFM actuarial or Heston financial model, or both.

5. Online Learning Phase

Given the satisfactory baseline hedging performance of the experienced RL agent in the MDP training environment, the insurer finally assigns the agent to interact and learn from the market environment.

Table 9. Summary statistics of empirical distributions of realised terminal P&Ls by the approaches of reinforcement learning, classical Deltas, and deep hedging.

Figure 6 Empirical density functions of realised pathwise differences of terminal P&Ls comparing with the approaches of classical Deltas and deep hedging.

Table 10. Summary statistics of empirical distributions of realised pathwise differences of terminal P&Ls comparing with the approaches of classical Deltas and deep hedging.

To distinguish them from the simulated time in the training environment, let $\tilde{t}_k$ , for $k=0,1,2,\dots$ , be the real time when the RL agent decides the hedging strategy in the market environment, such that $0=\tilde{t}_{0}<\tilde{t}_{1}<\tilde{t}_{2}<\cdots$ , and $\delta\tilde{t}_k=\tilde{t}_{k+1}-\tilde{t}_k=\frac{1}{252}$ . Note that the current time $t=\tilde{t}_0=0$ and the RL agent shall hedge daily on behalf of the insurer. At the current time 0, the insurer writes a variable annuity contract with the GMMB and GMDB riders to the first policyholder. When this first contract terminates, due to either the death of the first policyholder or the expiration of the contract, the insurer shall write an identical contract, i.e. contract with the same characteristics, to the second policyholder. And so on. These contract re-establishments ensure that the insurer shall hold only one written variable annuity contract with the GMMB and GMDB riders at a time, and the RL agent shall solely hedge the contract being effective at that moment.

To this end, iteratively, for the $\iota$ -th policyholder, where $\iota\in\mathbb{N}$ , let $\tilde{t}_{\tilde{n}^{\left(\iota\right)}}$ be the first time (right) after the $\iota$ -th policyholder dies or the contract expires, for some $\tilde{n}^{\left(\iota\right)}=\tilde{n}^{\left(\iota-1\right)}+1,\tilde{n}^{\left(\iota-1\right)}+2,\dots,\tilde{n}^{\left(\iota-1\right)}+n$ ; that is $\tilde{t}_{\tilde{n}^{\left(\iota\right)}}=\min\left\{\tilde{t}_{k},k=\tilde{n}^{\left(\iota-1\right)}+1,\tilde{n}^{\left(\iota-1\right)}+2,\dots,\!\!\right.$ $\left.\tilde{n}^{\left(\iota-1\right)}+n\;:\;\tilde{t}_{k}-\tilde{t}_{\tilde{n}^{\left(\iota-1\right)}}\geq T^{\left(\iota\right)}_{x_{\iota}}\wedge T\right\}$ , where, by convention, $\tilde{n}^{\left(0\right)}=0$ . Therefore, the contract effective time for the $\iota$ -th policyholder $\tau_{k}^{\left(\iota\right)}=\tilde{t}_{\tilde{n}^{\left(\iota-1\right)}+k}$ , where $\iota\in\mathbb{N}$ and $k=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}$ ; in particular, $\tau_{0}^{\left(\iota\right)}=\tilde{t}_{\tilde{n}^{\left(\iota-1\right)}}$ is the contract inception time for the $\iota$ -th policyholder. Figure 7 depicts one of the possible realisations for clearly illustrating the real time and the contract effective time.

Figure 7 An illustrative timeline with the real time and the contract effective time in the online learning phase.

In the online learning phase, the trained RL agent carries on with the PPO of policy gradient methods in the market environment. That is, as in section 3.4, starting from the ANN weights $\theta^{\left(\mathcal{U}\right)}$ at the current time 0, and via interacting with the market environment to observe the states and collect the reward signals, the RL agent further updates the ANN weights by a batch of $\tilde{K}\in\mathbb{N}$ realisations and the (stochastic) gradient ascent in (13) with the surrogate performance measure in (14), at each update step.

However, there are subtle differences of applying the PPO in the market environment from that in the training environment. At each further update step $v=1,2,\dots$ , based on the ANN weights $\theta^{\left(\mathcal{U}+v-1\right)}$ , and thus the policy $\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(\mathcal{U}+v-1\right)}\right)$ , the RL agent hedges each effective contract of $\tilde{E}^{\left(v\right)}\in\mathbb{N}$ realised policyholders for the $\tilde{K}\in\mathbb{N}$ realisations. Indeed, the concept of episodes in the training environment, by the state re-initiation when one episode ends, should be replaced by sequential policyholders in the real-time market environment, via the contract re-establishment when one policyholder dies or contract expires.

If $\tilde{E}^{\left(v\right)}=1$ , which is when $\left(v-1\right)\tilde{K},v\tilde{K}\in\left[\tilde{n}^{\left(\iota-1\right)},\tilde{n}^{\left(\iota\right)}\right]$ , for some $\iota\in\mathbb{N}$ , the batch of $\tilde{K}$ realisations is collected solely from the $\iota$ -th policyholder. The realisations are given by
\begin{align*}&\;\left\{\dots,x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}}}^{\left(v-1,\iota\right)},x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+\tilde{K}-1}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+\tilde{K}-1}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+\tilde{K}-1}}^{\left(v-1,\iota\right)},x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+\tilde{K}}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+\tilde{K}}}^{\left(v-1,\iota\right)},\dots\right\},\end{align*}
where $\tilde{K}_s^{\left(v\right)}=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}-1$ , such that the time $\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}}$ is when the first state is observed for the $\iota$ -th policyholder in this update; necessarily, $\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}-\tilde{K}_s^{\left(v\right)}\geq\tilde{K}$ .
If $\tilde{E}^{\left(v\right)}=2,3,\dots$ , which is when $\left(v-1\right)\tilde{K}\in\left[\tilde{n}^{\left(\iota-1\right)},\tilde{n}^{\left(\iota\right)}\right]$ and $v\tilde{K}\in\left[\tilde{n}^{\left(j-1\right)},\tilde{n}^{\left(j\right)}\right]$ , for some $\iota,j\in\mathbb{N}$ such that $\iota<j$ , the batch of $\tilde{K}$ realisations is collected from the $\iota$ -th, $\left(\iota+1\right)$ -th, $\dots$ , and j-th policyholders; that is, $\tilde{E}^{\left(v\right)}=j-\iota+1$ . The realisations are given by
\begin{align*}&\;\left\{\dots,x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}}}^{\left(v-1,\iota\right)},x_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{K}_s^{\left(v\right)}+1}}^{\left(v-1,\iota\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{\tau^{\left(\iota\right)}_{\tilde{n}^{\left(\iota\right)} - \tilde{n}^{\left(i - 1\right)} -1}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{n}^{\left(\iota\right)} - \tilde{n}^{\left(i - 1\right)} -1}}^{\left(v-1,\iota\right)},h_{\tau^{\left(\iota\right)}_{\tilde{n}^{\left(\iota\right)} - \tilde{n}^{\left(i - 1\right)} -1}}^{\left(v-1,\iota\right)},x_{\tau^{\left(\iota\right)}_{\tilde{n}^{\left(\iota\right)} - \tilde{n}^{\left(i - 1\right)}}}^{\left(v-1,\iota\right)},r_{\tau^{\left(\iota\right)}_{\tilde{n}^{\left(\iota\right)} - \tilde{n}^{\left(i - 1\right)}}}^{\left(v-1,\iota\right)}\right\},\\[5pt] &\;\left\{x_{\tau^{\left(\iota+1\right)}_{0}}^{\left(v-1,\iota+1\right)},h_{\tau^{\left(\iota+1\right)}_{0}}^{\left(v-1,\iota+1\right)},x_{\tau^{\left(\iota+1\right)}_{1}}^{\left(v-1,\iota+1\right)},r_{\tau^{\left(\iota+1\right)}_{1}}^{\left(v-1,\iota+1\right)},h_{\tau^{\left(\iota+1\right)}_{1}}^{\left(v-1,\iota+1\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{\tau^{\left(\iota+1\right)}_{\tilde{n}^{\left(\iota+1\right)} - \tilde{n}^{\left(\iota\right)} -1}}^{\left(v-1,\iota+1\right)},r_{\tau^{\left(\iota+1\right)}_{\tilde{n}^{\left(\iota+1\right)} - \tilde{n}^{\left(\iota\right)} -1}}^{\left(v-1,\iota+1\right)},h_{\tau^{\left(\iota+1\right)}_{\tilde{n}^{\left(\iota+1\right)} - \tilde{n}^{\left(\iota\right)} -1}}^{\left(v-1,\iota+1\right)},x_{\tau^{\left(\iota+1\right)}_{\tilde{n}^{\left(\iota+1\right)} - \tilde{n}^{\left(\iota\right)}}}^{\left(v-1,\iota+1\right)},r_{\tau^{\left(\iota+1\right)}_{\tilde{n}^{\left(\iota+1\right)} - \tilde{n}^{\left(\iota\right)}}}^{\left(v-1,\iota+1\right)}\right\},\\[5pt] &\;\dots,\\[5pt] &\;\left\{x_{\tau^{\left(j-1\right)}_{0}}^{\left(v-1,j-1\right)},h_{\tau^{\left(j-1\right)}_{0}}^{\left(v-1,j-1\right)},x_{\tau^{\left(j-1\right)}_{1}}^{\left(v-1,j-1\right)},r_{\tau^{\left(j-1\right)}_{1}}^{\left(v-1,j-1\right)},h_{\tau^{\left(j-1\right)}_{1}}^{\left(v-1,j-1\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{\tau^{\left(j-1\right)}_{\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(j-2\right)} -1}}^{\left(v-1,j-1\right)},r_{\tau^{\left(j-1\right)}_{\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(j-2\right)} -1}}^{\left(v-1,j-1\right)},h_{\tau^{\left(j-1\right)}_{\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(j-2\right)} -1}}^{\left(v-1,j-1\right)},x_{\tau^{\left(j-1\right)}_{\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(j-2\right)}}}^{\left(v-1,j-1\right)},r_{\tau^{\left(j-1\right)}_{\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(j-2\right)}}}^{\left(v-1,j-1\right)}\right\},\\[5pt] &\;\left\{x_{\tau^{\left(j\right)}_{0}}^{\left(v-1,j\right)},h_{\tau^{\left(j\right)}_{0}}^{\left(v-1,j\right)},x_{\tau^{\left(j\right)}_{1}}^{\left(v-1,j\right)},r_{\tau^{\left(j\right)}_{1}}^{\left(v-1,j\right)},h_{\tau^{\left(j\right)}_{1}}^{\left(v-1,j\right)},\right.\\[5pt] &\;\left.\quad\dots,x_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}-1}}^{\left(v-1,j\right)},r_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}-1}}^{\left(v-1,j\right)},h_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}-1}}^{\left(v-1,j\right)},x_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)},r_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)},\dots\right\},\end{align*}
where $\tilde{K}_f^{\left(v\right)}=1,2,\dots,\tilde{n}^{\left(j\right)}-\tilde{n}^{\left(j-1\right)}$ , such that the time $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}$ is when the last state is observed for the j-th policyholder in this update; necessarily, $\tilde{n}^{\left(j-1\right)} - \tilde{n}^{\left(i - 1\right)} + \tilde{K}_f^{\left(v\right)} - \tilde{K}_s^{\left(v\right)} = \tilde{K}$ .

Moreover, the first two features in the state vector (18) are based on the real-time risky asset price realisation from the market, while all features depend on a particular effective policyholder. For $\iota\in\mathbb{N}$ and $k=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}$ ,

(19)

\begin{equation}X_{\tau^{\left(\iota\right)}_{k}}^{\left(v-1,\iota\right)}=\begin{cases}\!\left(\ln F^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},1,T-\left(\tau^{\left(\iota\right)}_{k}-\tau^{\left(\iota\right)}_{0}\right)\right)&\textrm{if}\; k=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}-1\\[5pt] \left(\ln F^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},0,T-\left(\tau^{\left(\iota\right)}_{k}-\tau^{\left(\iota\right)}_{0}\right)\right)&\textrm{if}\; k=\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}\textrm{ and }T^{\left(\iota\right)}_{x_{\iota}}\leq T\\[5pt] \left(\ln F^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}},1,0\right)&\textrm{if}\; k=\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}\textrm{ and }T^{\left(\iota\right)}_{x_{\iota}}>T\end{cases},\end{equation}

where $F^{\left(\iota\right)}_t=\rho^{\left(\iota\right)} S_{t}e^{-m^{\left(\iota\right)}\!\left(t-\tau^{\left(\iota\right)}_{0}\right)}$ , if $t\in\left[\tau^{\left(\iota\right)}_{0},\tilde{t}_{\tilde{n}^{\left(\iota\right)}}\right]$ , $P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{0}}=0$ , and

\begin{align*}P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k}}=&\;\left(P^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k-1}}-H^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k-1}}S_{\tau^{\left(\iota\right)}_{k-1}}\right)e^{r\left(\tau^{\left(\iota\right)}_{k}-\tau^{\left(\iota\right)}_{k-1}\right)}+H^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{k-1}}S_{\tau^{\left(\iota\right)}_{k}}+m^{\left(\iota\right)}_e\int_{\tau^{\left(\iota\right)}_{k-1}}^{\tau^{\left(\iota\right)}_{k}}F^{\left(\iota\right)}_{s}e^{r\left(\tau^{\left(\iota\right)}_{k}-s\right)}J_s^{\left(\iota\right)}ds\\[5pt] &\;-\left(G_D-F_{T_{x_\iota}^{\left(\iota\right)}}^{\left(\iota\right)}\right)_+\unicode{x1d7d9}_{\{\tau^{\left(\iota\right)}_{k-1}<T_{x_\iota}^{\left(\iota\right)}\leq \tau^{\left(\iota\right)}_{k}\}}e^{r\left(\tau^{\left(\iota\right)}_{k}-T_{x_\iota}^{\left(\iota\right)}\right)},\end{align*}

for $k=1,2,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}$ . Recall also that the reward signals collecting from the market environment should be based on that in (8); that is, for $\iota\in\mathbb{N}$ and $k=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}$ ,

\begin{equation*}R_{\tau^{\left(\iota\right)}_{k}}^{\left(v-1,\iota\right)}=\begin{cases}0&\textrm{if}\quad k=0,1,\dots,\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}-1\\[5pt] -\left(P^{\left(\iota\right)}_{\tilde{t}_{\tilde{n}^{\left(\iota\right)}}}-L^{\left(\iota\right)}_{\tilde{t}_{\tilde{n}^{\left(\iota\right)}}}\right)^2&\textrm{if}\quad k=\tilde{n}^{\left(\iota\right)}-\tilde{n}^{\left(\iota-1\right)}\end{cases},\end{equation*}

in which $L^{\left(\iota\right)}_{\tilde{t}_{\tilde{n}^{\left(\iota\right)}}}=0$ if $T^{\left(\iota\right)}_{x_{\iota}}\leq T$ , and $L^{\left(\iota\right)}_{\tilde{t}_{\tilde{n}^{\left(\iota\right)}}}=\left(G_M-F^{\left(\iota\right)}_{\tau^{\left(\iota\right)}_{0}+T}\right)_+$ if $T^{\left(\iota\right)}_{x_{\iota}}>T$ .

Table 11 summarises all hyperparameters of the implemented PPO in the market environment, while the hyperparameters of the ANN architecture are still given in Table 6(b). In the online learning phase, the insurer should choose a smaller batch size $\tilde{K}$ comparing to that in the training phase; this yields a higher updating frequency by the PPO to ensure that the experienced RL agent could revise the hedging strategy within a reasonable amount of time. However, fewer realisations in the batch cause less credible updates; hence, the insurer should also tune down the learning rate $\tilde{\alpha}$ , from that in the training phase, to reduce the reliance on each further update step.

Table 11. Hyperparameters setting of Proximal Policy Optimisation for online learning with bolded hyperparameters being different from those for training.

6. Illustrative Example Revisited: Online Learning Phase

This section revisits the illustrative example in section 2.4 via the two-phase RL approach in the online learning phase. In the market environment, the policyholders being sequentially written of the contracts with both GMMB and GMDB riders are homogeneous. Due to contract re-establishments to these sequential homogeneous policyholders, the number and age of policyholders shall be reset to the values as in Table 3(b) at each contract inception time. Furthermore, via the approach discussed in section 2.1.3, to determine the fee structures of each contract at its inception time, the insurer relies on the parameters of the model of the market environment in Table 3, except that now the risky asset initial price therein is replaced by the risky asset price observed at the contract inception time. Note that the fee structures of the first contract are still given as in Table 4, since the risky asset price observed at $t=0$ is exactly the risky asset initial price.

Let $\mathcal{V}\in\mathbb{N}$ be the number of further update steps in the market environment on the ANN weights. In order to showcase the result that (RLw/OL), the further trained RL agent with the online learning phase, could gradually revise the hedging strategy, from the nearly optimal one in the training environment, to the one in the market environment, we evaluate the hedging performance of RLw/OL on a rolling basis. That is, right after each further update step $v=1,2,\dots,\mathcal{V}$ , we first simulate $\tilde{M}=500$ market scenarios stemming from the real-time realised state vector $x_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)}$ and by implementing the hedging strategy from the updated policy $\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(\mathcal{U}+v\right)}\right)$ , i.e. the further trained RL agent takes the deterministic action $c\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(\mathcal{U}+v\right)}\right)$ which is the mean of the Gaussian measure; we then document the realised terminal P&L, for each of the 500 simulated scenarios, i.e. $P^{\textrm{RLw/OL}}_{t}\!\left(\omega_e\right)-L_{t}\!\left(\omega_e\right)$ , for $e=1,2,\dots,500$ , where $t=\tilde{t}_{\tilde{n}^{\left(j\right)}}\!\left(\omega_e\right)$ if $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}<\tilde{t}_{\tilde{n}^{\left(j\right)}}$ , and $t=\tilde{t}_{\tilde{n}^{\left(j+1\right)}}\!\left(\omega_e\right)$ if $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}=\tilde{t}_{\tilde{n}^{\left(j\right)}}$ .

Since the state vector $x_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)}$ is realised in real time, the realised terminal P&L in fact depends on, not only the simulated scenarios after each update but also the actual realisation in the market environment. To this end, from the current time 0, we simulate $M=1,000$ future trajectories in the market environment; for each future trajectory $f=1,2,\dots,1,000$ , the aforementioned realised terminal P&Ls are obtained as $P^{\textrm{RLw/OL}}_{t}\!\left(\omega_f,\omega_e\right)-L_{t}\!\left(\omega_f,\omega_e\right)$ , for $e=1,2,\dots,500$ , where $t=\tilde{t}_{\tilde{n}^{\left(j\right)}}\!\left(\omega_f,\omega_e\right)$ if $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}\!\left(\omega_f\right)<\tilde{t}_{\tilde{n}^{\left(j\right)}}\!\left(\omega_f\right)$ , and $t=\tilde{t}_{\tilde{n}^{\left(j+1\right)}}\!\left(\omega_f,\omega_e\right)$ if $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}\!\left(\omega_f\right)=\tilde{t}_{\tilde{n}^{\left(j\right)}}\!\left(\omega_f\right)$ .

The rolling-basis hedging performance of RLw/OL is benchmarked with those by, (RLw/oOL) the trained RL agent without the online learning phase, (CD) the correct Delta based on the market environment, and (ID) the incorrect Delta based on the training environment. For the same set of future trajectories $\omega_f$ , for $f=1,2,\dots,1,000$ , and the same sets of simulated scenarios $\omega_e$ , for $e=1,2,\dots,500$ , the realised terminal P&Ls are also obtained, by implementing each of these benchmark strategies starting from the current time 0, which does not need to be updated throughout; denote the realised terminal P&L as $P^{\mathcal{S}}_{t}\!\left(\omega_f,\omega_e\right)-L_{t}\!\left(\omega_f,\omega_e\right)$ , where $\mathcal{S}=\textrm{RLw/OL},\textrm{RLw/oOL},\textrm{CD},\textrm{ or }\textrm{ID}$ .

This example considers $\mathcal{V}=25$ further update steps of RLw/OL, for each future trajectory $\omega_f$ , where $f=1,2,\dots,1,000$ ; as the batch size in the online learning phase $\tilde{K}=30$ , this is equivalent to 750 trading days, which is just less than 3 years (assuming that non-trading days are uniformly spread across a year). For each $f=1,2,\dots,1,000$ , and $v=1,2,\dots,25$ , let $\mu^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)$ be the expected terminal P&L, right after the v-th further update step implementing the hedging strategy $\mathcal{S}$ for the future trajectory $\omega_f$ :

\begin{equation*}\mu^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right) = \mathbb{E} \left[P^{\mathcal{S}}_{t}\!\left(\omega_f,\cdot\right)-L_{t}\!\left(\omega_f,\cdot\right)\Big\vert X_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)} = X_{\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}}^{\left(v-1,j\right)}\!\left(\omega_f\right)\right],\end{equation*}

which is a conditional expectation taking with respect to the scenarios from the time $\tau^{\left(j\right)}_{\tilde{K}_f^{\left(v\right)}}$ forward; let $\hat{\mu}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)$ be the sample mean of the terminal P&L based on the simulated scenarios:

(20)

\begin{equation}\hat{\mu}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right) = \frac{1}{500}\sum_{e = 1}^{500}\!\left(P_{t}^{\mathcal{S}}\!\left(\omega_f,\omega_e\right)-L_{t}\!\left(\omega_f,\omega_e\right)\right).\end{equation}

Figure 8 plots the sample means of the terminal P&L in (20), right after each further update step and implementing each hedging strategy, in two future trajectories. Firstly, notice that, in both future trajectories, the average hedging performance of RLw/oOL is even worse than that of ID. Secondly, the average hedging performances of RLw/OL between the two future trajectories are substantially different. In the best-case future trajectory, the RLw/OL is able to swiftly self-revise the hedging strategy and hence quickly catch up the average hedging performance of ID by simply twelve further updates on the ANN weights, as well as that of CD in around two years; however, in the worst-case future trajectory, within 3 years, the RLw/OL is not able to improve the average hedging performance to even the level of ID, let alone to that of CD.

In view of the second observation above, the hedging performance of RLw/OL should not be concluded for each future trajectory alone; instead, it should be studied among the future trajectories. To this end, for each $f=1,2,\dots,1,000$ , define

\begin{equation*}v_{\textrm{CD}}\!\left(\omega_f\right) = \min\left\{v = 1, 2, \dots, 25\;:\; \hat{\mu}^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)>\hat{\mu}^{\left(v,j\right)}_{\textrm{CD}}\!\left(\omega_f\right)\right\}\end{equation*}

as the first further update step such that the sample mean of the terminal P&L by RLw/OL is strictly greater than that by CD, for the future trajectory $\omega_f$ ; herein, let $\min\emptyset=26$ and also define $t_{\textrm{CD}}\!\left(\omega_f\right)=v_{\textrm{CD}}\!\left(\omega_f\right)\times\frac{\tilde{K}}{252}$ as the corresponding number of years. Therefore, the estimated proportion of the future trajectories, where RLw/OL is able to exceed the average hedging performance of CD within 3 years, is given by

\begin{equation*}\frac{1}{1,000}\sum_{f = 1}^{1,000}\unicode{x1d7d9}_{\left\{t_{\textrm{CD}}\!\left(\omega_f\right)\leq 3\right\}}= 95.4{\%}.\end{equation*}

For each $f=1,2,\dots,1,000$ , define $v_{\textrm{ID}}\!\left(\omega_f\right)$ and $t_{\textrm{ID}}\!\left(\omega_f\right)$ similarly for comparing RLw/OL with ID. Figure 9 shows the empirical conditional density functions of $t_{\textrm{CD}}$ and $t_{\textrm{ID}}$ , both subject to that RLw/OL exceeds the average hedging performance of CD within 3 years. Table 12 lists the summary statistics of the empirical conditional distributions.

The above analysis obviously neglected the variance, due to the simulated scenarios, of hedging performance by each hedging strategy. In the following, for each future trajectory, we define a refined first further update step such that the expected terminal P&L by RLw/OL is statistically significant to be strictly greater than that by CD. To this end, for each $f=1,2,\dots,1,000$ , and $v=1,2,\dots,25$ , consider the following null and alternative hypotheses:

\begin{equation*}H_{0,\mathcal{S}}^{\left(v,j\right)}\!\left(\omega_f\right)\;:\;\mu^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)\leq\mu^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)\quad\textrm{versus}\quad H_{1,\mathcal{S}}^{\left(v,j\right)}\!\left(\omega_f\right)\;:\;\mu^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)>\mu^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right),\end{equation*}

where $\mathcal{S}=\textrm{CD or ID}$ ; the analysis before supports this choice of the alternative hypothesis. Define respectively the test statistics and the p-value by

Figure 8 Best-case and worst-case samples of future trajectories for rolling-basis evaluation of reinforcement learning agent with online learning phase, and comparisons with classical Deltas and reinforcement learning agent without online learning phase.

Figure 9 Empirical conditional density functions of first surpassing times conditioning on reinforcement learning agent with online learning phase exceeding correct Delta in terms of sample means of terminal P&L within 3 years.

\begin{equation*}\mathcal{T}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right) = \frac{\hat{\mu}^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right) - \hat{\mu}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)}{\sqrt{\frac{\hat{\sigma}^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)^2}{500}+\frac{\hat{\sigma}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)^2}{500}}}\quad\textrm{and}\quad p^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right) = \mathbb{P}\!\left(T_{\mathcal{S}}\!\left(\omega_f\right)>\mathcal{T}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)\right),\end{equation*}

where the random variable $T_{\mathcal{S}}\!\left(\omega_f\right)$ follows a Student’s t-distribution with the degree of freedom

\begin{equation*}\textrm{df}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right) = \frac{\left(\frac{\hat{\sigma}^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)^2}{500}+\frac{\hat{\sigma}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)^2}{500}\right)^2}{\frac{\left(\hat{\sigma}^{\left(v,j\right)}_{\textrm{RLw/OL}}\!\left(\omega_f\right)^2/500\right)^2}{500-1}+\frac{\left(\hat{\sigma}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)^2/500\right)^2}{500-1}},\end{equation*}

and the sample variance $\hat{\sigma}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)^2$ of the terminal P&L based on the simulated scenarios is given by

\begin{equation*}\hat{\sigma}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)^2= \frac{1}{499}\sum_{e = 1}^{500}\!\left(\left(P_{t}^{\mathcal{S}}\!\left(\omega_f,\omega_e\right)-L_{t}\!\left(\omega_f,\omega_e\right)\right)-\hat{\mu}^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)\right)^2.\end{equation*}

For a fixed level of significance $\alpha^*\in\!\left(0,1\right)$ , if $p^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)<\alpha^*$ , then the expected terminal P&L by RLw/OL is statistically significant to be strictly greater than that by $\mathcal{S}=\textrm{CD or ID}$ .

In turn, for each $f=1,2,\dots,1,000$ , and for any $\alpha^*\in\left(0,1\right)$ , define

\begin{equation*}v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*\right) = \min\left\{v = 1, 2, \dots, 25\;:\; p^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)<\alpha^*\right\}\end{equation*}

as the first further update step such that the expected terminal P&L by RLw/OL is statistically significant to be strictly greater than that by $\mathcal{S}=\textrm{CD or ID}$ , for the future trajectory $\omega_f$ at the level of significance $\alpha^*$ ; again, herein, let $\min\emptyset=26$ and define $t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*\right)=v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*\right)\times\frac{\tilde{K}}{252}$ as the corresponding number of years. Table 13 lists the estimated proportion of the future trajectories, where RLw/OL is statistically significant to be able to exceed the expected terminal P&L of $\mathcal{S}$ within 3 years, which is given by $\sum_{f = 1}^{1,000}\unicode{x1d7d9}_{\left\{t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*\right)\leq 3\right\}}/1,000$ , with various levels of significance.

When the level of significance $\alpha^*$ gradually decreases from $0.20$ to $0.01$ , both estimated proportions, of the future trajectories for RLw/OL being statistically significant to be exceeding CD or ID within 3 years, decline. This is because, for any $\alpha^*_1,\alpha^*_2\in\!\left(0,1\right)$ with $\alpha^*_1\leq\alpha^*_2$ , and for any $\omega_f$ , for $f=1,2,\dots,1,000$ , $t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 3$ implies that $t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_2\right)\leq 3$ , and thus, $\unicode{x1d7d9}_{\left\{t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 3\right\}}\leq\unicode{x1d7d9}_{\left\{t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_2\right)\leq 3\right\}}$ , which leads to that $\sum_{f = 1}^{1,000}\unicode{x1d7d9}_{\left\{t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 3\right\}}/1,000\leq\sum_{f = 1}^{1,000}\unicode{x1d7d9}_{\left\{t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_2\right)\leq 3\right\}}/1,000$ ; indeed, since $t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 3$ , or equivalently $v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right) \leq 25$ , we have $p^{\left(v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right),j\right)}_{\mathcal{S}}\!\left(\omega_f\right) \lt \alpha^*_1\leq\alpha^*_2$ , and thus

\begin{equation*}v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_2\right) = \min\left\{v = 1, 2, \dots, 25\;:\; p^{\left(v,j\right)}_{\mathcal{S}}\!\left(\omega_f\right)<\alpha^*_2\right\}\leq v_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 25,\end{equation*}

or equivalently $t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_2\right) \leq t_{\mathcal{S}}^{\textrm{p}}\!\left(\omega_f;\;\alpha^*_1\right)\leq 3$ . However, notably, the declining rate of the estimated proportion for exceeding CD is greater than that for exceeding ID.

Similar to Figure 9 and Table 12, one can depict the empirical conditional density functions and list the summary statistics of $t_{\textrm{CD}}^{\textrm{p}}\!\left(\cdot;\;\alpha^*\right)$ and $t_{\textrm{ID}}^{\textrm{p}}\!\left(\cdot;\;\alpha^*\right)$ , for each level of significance $\alpha^*$ , subject to that RLw/OL is statistically significant to be exceeding CD within 3 years. For example, with $\alpha^*=0.1$ , Figure 10 and Table 14 illustrate that, comparing with Figure 9 and Table 12, the distributions are right-shifted as well as more spread, and the summary statistics are all increased.

Table 12. Summary statistics of empirical conditional distributions of first surpassing times conditioning on reinforcement learning agent with online learning phase exceeding correct Delta in terms of sample means of terminal P&L within 3 years.

Table 13. Estimated proportions of future trajectories where reinforcement learning agent with online learning phase is statistically significant to be exceeding correct Delta and incorrect Delta within 3 years with various levels of significance.

Figure 10 Empirical conditional density functions of first statistically significant surpassing times conditioning on reinforcement learning agent with online learning phase being statistically significant to be exceeding correct Delta within 3 years for $0.1$ level of significance.

Finally, to further examine the hedging performance of RLw/OL in terms of the sample mean of the terminal P&L in (20), as well as take the random future trajectories into account, Figure 11 shows the snapshots of the empirical density functions, among the future trajectories, of the sample mean by each hedging strategy over time at $t=0,0.6,1.2,1.8,2.4,\textrm{ and }3$ ; Table 15 outlines their summary statistics. Note that, at the current time $t=0$ , since none of the future trajectories has been realised yet, the empirical density functions are given by Dirac delta at the corresponding sample mean by each hedging strategy, which only depends on the simulated scenarios. As the time progresses, one can observe that the empirical density function by RLw/OL is gradually shifting to the right, substantially passing the one by ID and almost catching up the one by CD at $t=1.8$ . This sheds light on the high probability that RLw/OL is able to self-revise the hedging strategy from a very sub-optimal one to a nearly optimal one close to the CD.

Table 14. Summary statistics of empirical conditional distributions of first statistically significant surpassing times conditioning on reinforcement learning agent with online learning phase being statistically significant to be exceeding correct Delta within 3 years for $0.1$ level of significance.

Figure 11 Snapshots of empirical density functions of sample mean of terminal P&L by reinforcement learning agent with online learning phase, reinforcement learning agent without online learning phase, correct Delta, and incorrect Delta at different time points.

7. Methodological Assumptions and Implications in Practice

To apply the proposed two-phase RL approach to a hedging problem of contingent claims, there are at least four assumptions to be satisfied. This section discusses these assumptions and elaborates their implications in practice.

Table 15. Summary statistics of empirical distributions of sample mean of terminal P&L by reinforcement learning agent with online learning phase, reinforcement learning agent without online learning phase, correct Delta, and incorrect Delta at different time points.

7.1. Observable, sufficient, relevant, and transformed features in state

One of the crucial components in an MDP environment of the training phase or the online learning phase is the state, in which the features provide information from the environment to the RL agent. First, the features must be observable by the RL agent for learning. For instance, in our proposed state vectors (18) and (19), all the four features, namely the segregated account value, the hedging portfolio value, the number of surviving policyholders, and the term to maturity, are observable. Any unobservable, albeit desirable, features cannot be included in the state, such as insider information which could provide a better inference on the future value of a risky asset, or exact health condition of a policyholder. Second, the observable features in the state should be sufficient for the RL agent to learn. For example, due to the dual-risk bearing nature of the contract in this paper, the proposed state vectors (18) and (19) incorporate both financial and actuarial features; also, the third and the fourth features in the state vectors (18) and (19) would inform the RL agent to halt its hedging at the terminal time. However, incorporating sufficient observable features in the state does not imply that every observable feature in the environment should be included; the observable features in the state need to be relevant for learning efficiently. Since the segregated account value and the term to maturity have already been included in the state vectors (18) and (19) as features, the risky asset value and the hedging time are respective similar information from the environment and thus are redundant features to be contained in the state. Finally, the features in the state which have high variance might be appropriately transformed for reducing the volatility due to exploration. For instance, the segregated account value in the state vectors (18) and (19) is log-transformed in both phases.

7.2. Reward engineering

Another crucial component in an MDP environment is the reward, which supplies signals to the RL agent to evaluate its actions, i.e. the hedging strategy, for learning. First, the reward signals, if available, should suggest the local hedging performance. For example, in this paper, the RL agent is provided by the sequential anchor-hedging reward, given in (9), in the training phase; through the net liability value in the MDP training environment, the RL agent often receives a positive (resp. negative) signal for encouragement (resp. punishment), which is more informative than collecting the zero reward. However, any informative reward signals need to be computable from an MDP environment. In this paper, since the insurer does not know the MDP market environment, the RL agent could not be supplied the sequential anchor-hedging reward signals, which consist of the net liability values, in the online learning phase, even though they are more informative; instead, the RL agent is given the less informative single terminal reward, given in (8), in the online learning phase which can be computed from the market environment.

7.3. Markov property in state and action

In an MDP environment of the training phase or the online learning phase, the state and action pair needs to satisfy the Markov property as in (3). In the training phase, since the MDP training environment is constructed, the Markov property can be verified theoretically for the state, with the included features in line with section 7.1, and the action, which is the hedging strategy. For example, in this paper, with the model of the market environment being the BS and the CFM, the state vector in (18) and the Markovian hedging strategy satisfy the Markov property in the training phase. Since the illustrative example in this paper assumes that the market environment also follows the BS and the CFM, the state vector in (19) and the Markovian hedging strategy satisfy the Markov property in the online learning phase as well. However, in general, as the market environment is unknown, the Markov property for the state and action pair would need to be checked statistically in the online phase as follows.

After the training phase and before an RL agent proceeding to the online learning phase, historical state and action sequences in a time frame are derived by hypothetically writing identical contingent claims and using the historical realisations from the market environment. For instance, historical values of risky assets are publicly available, or an insurer retrieves historical survival status of its policyholders with similar demographic information and medical history as the policyholder being actually written. These historical samples of the state and action pair are then used to conduct hypothesis testing on whether the Markov property in (3) holds for the pair in the market environment, by, for example, the test statistics proposed in Chen & Hong (Reference Chen and Hong2012). If the Markov property holds statistically, the RL agent could begin the online learning phase. Yet, if the property does not hold statistically, the state and action pair should be revised and then the training phase should be revisited; since the hedging strategy is the action in a hedging problem, only the state could be amended by including more features from the environment. Moreover, during the online learning phase, right after each further update step, new historical state and action sequences in a shifted time frame of the same duration are obtained together with the most recent historical realisations from the market environment and using the action samples being drawn from the updated policy. These regularly new samples should be applied to statistically verify the Markov property on a rolling basis. If the property fails to hold at any time, the state needs to be revised and the RL agent must be re-trained before resuming the online learning.

7.4. Re-establishment of contingent claims in online learning phase

Any contingent claims must have a finite terminal time realisation. On one hand, in the training phase, that would be the time when an episode ends and the state is re-initialised so that the RL agent can be trained in the training environment as long as possible. On the other hand, in the online learning phase, the market environment, and hence the state, could not be re-initialised; instead, at each terminal time realisation, the seller re-establishes identical contingent claims of the same contract characteristics and writing on (more or less) the same assets so that the RL agent can be trained in the market environment successively. In this paper, the terms to maturity and the minimum guarantees of all variable annuity contracts in the online learning phase are the same. Moreover, all re-established contracts therein write on the same financial risky asset, though the initial values of the asset are given by the real-time realisations in the market environment. Finally, while a new policyholder is written at each contract inception time, these policyholders have similar, if not identical, distributions of their random future lifetimes via examining their demographic information and medical history.

8. Concluding Remarks and Future Directions

This paper proposed the two-phase deep RL approach which can tackle practically common model miscalibration in hedging variable annuity contracts with both GMMB and GMDB riders in the BS financial and CFM actuarial market environments. The approach is composed of the training phase and the online learning phase. While the satisfactory hedging performance of the trained RL agent in the training environment was anticipated, the performance by the further trained RL agent in the market environment via the illustrative example should be highlighted. First, by comparing their sample means of terminal P&L from simulated scenarios, in most future trajectories, within a reasonable amount of time, the further trained RL agent was able to exceed the hedging performance by the correct Delta from the market environment and the incorrect Delta from the training environment. Second, through a more delicate hypothesis testing analysis, similar conclusions can be drawn in a fair amount of future trajectories. Finally, snapshots of empirical density functions, among the future trajectories, of the sample means of terminal P&L from simulated scenarios by each hedging strategy, shed light on the high probability that the further trained RL agent is indeed able to self-revise the hedging strategy.

There should be at least two future directions derived from this paper. (I) The market environment in the illustrative example of this paper was assumed to be the BS financial and CFM actuarial models, which turned out to be the same as designed by the insurer for the training environment, with different parameters though. Moreover, the policyholders were assumed to be homogeneous that their survival probabilities and investment behaviours are all the same, with even identical contracts of the same minimum guarantee and maturity. In the market environment, the agent only had to hedge one contract at a time, instead of a portfolio of contracts. Obviously, if any of these is to be relaxed, the trained RL agent from the current training environment should not be able to produce satisfactory hedging performance in a market environment. Therefore, the training environment will certainly need to be substantially extended in terms of its sophistication, in order for the trained RL agent to be able to further learn and hedge well in any realistic market environments. (II) Beyond this, an even more ambitious question needs to be addressed is that how much similar do the training and market environments have to be, such that the online learning for self-revision on hedging strategy is possible, if not efficient. This second future direction is related to the transfer learning being adapted to the variable annuities hedging problem and shall be investigated carefully in the future.

Appendix A. Deep Hedging Approach

In this section, we provide a brief review of the DH approach adapted from Bühler et al. (Reference Bühler, Gonon, Teichmann and Wood2019). In particular, the hedging objective of the insurer is still given as $\sqrt{\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]}$ , with Equation (2) being the optimal (discrete) hedging strategy. The hedging agent built by the insurer using the DH algorithm shall be called the DH agent hereafter.

A.1. Deterministic Action

Different from section 3.1, in which the RL agent takes a stochastic action which is sampled from the policy for the exploration in the MDP environment, the DH agent only deploys a deterministic action $H^{\textrm{DH}}\;:\; \mathcal{X} \to \mathcal{A}$ , which is a direct mapping from the state space to the action space. Specifically, at each time $t_k$ , where $k = 0,1, \dots, n-1$ , given the current state $X_{t_k} \in \mathcal{X}$ , the DH agent takes an action $H^{\textrm{DH}}\!\left(X_{t_k}\right) \in \mathcal{A}$ . In this case, the objective of the DH agent is to solve for the optimal hedging strategy $H^{\textrm{DH},*}\!\left(\cdot\right)$ that minimises $\sqrt{\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]}$ , or equivalently minimises $\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}-L_{t_{\tilde{n}}}\right)^2\right]$ .

A.2. Action Approximation and Parameterisation

The deterministic action mapping $H^{\textrm{DH}}\;:\; \mathcal{X} \to \mathcal{A}$ is then approximated and parameterised by an ANN with weights $\upsilon_{\textrm{a}}$ . The construction of such ANN $\mathcal{N}_{\textrm{a}}\!\left(\cdot;\;\upsilon_{\textrm{a}}\right)$ is similar to that in section 3.3.1, except that $\mathcal{N}_{\textrm{a}}\!\left(x;\;\upsilon_{\textrm{a}}\right) \in \mathbb{R}$ for any $x \in \mathbb{R}^{p}$ ; that is, $\mathcal{N}_{\textrm{a}}\!\left(\cdot;\;\upsilon_{\textrm{a}}\right)$ takes a state vector $x\in \mathbb{R}^p$ as the input, and directly outputs a deterministic action $a\left(x;\;\upsilon_{\textrm{a}}\right) \in \mathbb{R}$ , instead of the Gaussian mean-variance tuple $\left(c\!\left(x;\;\upsilon_{\textrm{a}}\right),d^2\left(x;\;\upsilon_{\textrm{a}}\right)\right) \in \mathbb{R} \times \mathbb{R}^+$ in the RL approach, which then samples an action from the Gaussian measure. Hence, in the DH approach, solving the optimal hedging strategy $H^{\textrm{DH},*}\!\left(\cdot\right)$ boils down to finding the optimal weights $\upsilon_{a}^*$ .

A.3. Deep Hedging Method

The DH agent starts from initial ANN weights $\upsilon_{\textrm{a}}^{\left(0\right)}$ , deploys the hedging strategy to collect terminal P&Ls, and gradually updates the ANN weights by stochastic gradient ascent as shown in Equation (13), with $\theta$ replaced by $\upsilon$ . For the DH agent, at each update step $u = 1, 2, \dots$ , the surrogate performance measure is given as

\begin{align*} \mathcal{J}^{\left(u-1\right)}\!\left(\upsilon_{\textrm{a}}^{\left(u-1\right)}\right) = -\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}^{\left(u-1\right)}-L_{t_{\tilde{n}}}^{\left(u-1\right)}\right)^2\right]. \end{align*}

Correspondingly, the gradient of the surrogate performance measure with respect to the ANN weights $\upsilon_{\textrm{a}}$ is

\begin{align*} \nabla_{\upsilon_{\textrm{a}}}\mathcal{J}^{\left(u-1\right)}\!\left(\upsilon_{\textrm{a}}^{\left(u-1\right)}\right) = -2\mathbb{E}\!\left[\left(P_{t_{\tilde{n}}}^{\left(u-1\right)}-L_{t_{\tilde{n}}}^{\left(u-1\right)}\right)\nabla_{\upsilon_{\textrm{a}}}P_{t_{\tilde{n}}}^{\left(u-1\right)}\right]. \end{align*}

Therefore, based on the realised terminal P&L $p_{t_{\tilde{n}}}^{\left(u-1\right)}$ and $l_{t_{\tilde{n}}}^{\left(u-1\right)}$ , the estimated gradient is given as

\begin{align*} \widehat{\nabla_{\upsilon_{\textrm{a}}}\mathcal{J}^{\left(u-1\right)}\!\left(\upsilon_{\textrm{a}}^{\left(u-1\right)}\right)} = -2\left(p_{t_{\tilde{n}}}^{\left(u-1\right)} - l_{t_{\tilde{n}}}^{\left(u-1\right)}\right)\nabla_{\upsilon_{\textrm{a}}}p_{t_{\tilde{n}}}^{\left(u-1\right)}. \end{align*}

Algorithm 1 summarises the DH method above.

Algorithm 1. Pseudo-code for deep hedging method

Compared with policy gradient methods introduced in section 3.4, the DH method shows two key differences. First, it assumes that the hedging portfolio value $P_{t_{\tilde{n}}}^{\left(u-1\right)}$ is differentiable with respect to $\upsilon_{\textrm{a}}$ at each update $u = 1, 2,\dots$ . Second, the update of ANN weights does not depend on intermediate rewards collected during an episode; that is, to update the weights, the DH agent has to experience a complete episode to realise the terminal P&L. Therefore, the update frequency of the DH method is lower than that of the RL method with TD feature.

Appendix B. REINFORCE: A Monte Carlo Policy Gradient Method

At each update step $u=1,2,\dots$ , based on the ANN weights $\theta^{\left(u-1\right)}$ , and thus the policy $\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ , the RL agent experiences the realised episode:

\begin{equation*}\left\{x_{t_0}^{\left(u-1\right)},h_{t_0}^{\left(u-1\right)},x_{t_1}^{\left(u-1\right)},r_{t_1}^{\left(u-1\right)},h_{t_1}^{\left(u-1\right)},\dots,x_{t_{\tilde{n}-1}}^{\left(u-1\right)},r_{t_{\tilde{n}-1}}^{\left(u-1\right)},h_{t_{\tilde{n}-1}}^{\left(u-1\right)},x_{t_{\tilde{n}}}^{\left(u-1\right)},r_{t_{\tilde{n}}}^{\left(u-1\right)}\right\},\end{equation*}

where $h_{t_k}^{\left(u-1\right)}$ , for $k=0,1,\dots,\tilde{n}-1$ , is the time $t_k$ realised hedging strategy being sampled from the Gaussian distribution with the mean $c\!\left(x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ and the variance $d^2\left(x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ . In the following, fix an update step $u=1,2,\dots$ .

REINFORCE takes directly the time-0 value function $V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}\right)$ , for any $x\in\mathcal{X}$ , as a part of the surrogate performance measure:

\begin{equation*}V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=0}^{\tilde{n}-1}R_{t_{k+1}}^{\left(u-1\right)}\Big\vert X^{\left(u-1\right)}_{0}=x\right].\end{equation*}

In Williams (Reference Williams1992), the Policy Gradient Theorem was proved, which states that

\begin{equation*}\nabla_{\theta_{\textrm{p}}}V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}\right)=\mathbb{E}\!\left[\sum_{k=0}^{\tilde{n}-1}\!\left(\sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}^{\left(u-1\right)}\right)\nabla_{\theta_{\textrm{p}}}\ln\phi\left(H^{\left(u-1\right)}_{t_k};\;X^{\left(u-1\right)}_{t_k},\theta_{\textrm{p}}\right)\Big\vert X^{\left(u-1\right)}_{0}=x\right],\end{equation*}

where $\phi\left(\cdot;\;X^{\left(u-1\right)}_{t_k},\theta_{\textrm{p}}\right)$ is the Gaussian density function with mean $c\!\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)$ and variance $d^2\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)$ . Therefore, based on the realised episode, the estimated gradient of the time-0 value function is given by

\begin{equation*}\widehat{\nabla_{\theta_{\textrm{p}}}V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)}=\sum_{k=0}^{\tilde{n}-1}\!\left(\sum_{l=k}^{\tilde{n}-1}r_{t_{l+1}}^{\left(u-1\right)}\right)\nabla_{\theta_{\textrm{p}}}\ln\phi\left(h_{t_k}^{\left(u-1\right)};\;x_{t_k}^{\left(u-1\right)},\theta_{\textrm{p}}^{\left(u-1\right)}\right).\end{equation*}

Notice that, thanks to the Policy Gradient Theorem, the gradient of the surrogate performance measure does not depend on the gradient of the reward function, and hence, the reward function could be discrete or non-differentiable while the estimated gradient of the surrogate performance measure only needs the numerical reward values. However, in the DH approach of Bühler et al. (Reference Bühler, Gonon, Teichmann and Wood2019), the gradient of the surrogate performance measure therein does depend on the gradient of the terminal loss function and thus that approach implicitly requires the differentiability of the hedging portfolio value while the estimated gradient of the surrogate performance requires its numerical gradient values. See Appendix A for more details.

To reduce the variance of estimated gradient above, Williams (Reference Williams1992) suggested to introduce an unbiased baseline in this gradient, where a natural choice is the value function:

\begin{align*}\nabla_{\theta_{\textrm{p}}}V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}\right) &=\mathbb{E}\!\left[\sum_{k=0}^{\tilde{n}-1}\!\left(\sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}^{\left(u-1\right)}-V\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)\right)\right. \\[5pt] & \quad \left. \nabla_{\theta_{\textrm{p}}}\ln\phi\left(H^{\left(u-1\right)}_{t_k};\;X^{\left(u-1\right)}_{t_k},\theta_{\textrm{p}}\right)\Big\vert X^{\left(u-1\right)}_{0}=x\right];\end{align*}

see also Weaver and Tao (Reference Weaver and Tao2001). Herein, at any time $t_k$ , for $k=0,1,\dots,\tilde{n}-1$ , $A^{\left(u-1\right)}_{t_k}=\sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}^{\left(u-1\right)}-V\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{p}}\right)$ is called an advantage. Since the true value function is unknown to the RL agent, it is approximated by $\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)=\mathcal{N}_{\textrm{v}}\!\left(X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)$ , defined in (12), and in which the ANN weights are evaluated at $\theta_{\textrm{v}}=\theta_{\textrm{v}}^{\left(u-1\right)}$ as the gradient of the time-0 value function is independent of the ANN weights $\theta_{\textrm{v}}$ ; hence, the estimated advantage is given by $\hat{A}^{\left(u-1\right)}_{t_k}=\sum_{l=k}^{\tilde{n}-1}R_{t_{l+1}}^{\left(u-1\right)}-\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)$ .

Due to the value function approximation in the baseline, REINFORCE includes a second component in the surrogate performance measure, which aims to minimise the loss between the sum of reward signals and the approximated value function by the ANN. Therefore, the surrogate performance measure is given by:

\begin{align*}\mathcal{J}^{\left(u-1\right)}\!\left(\theta\right)&=V^{\left(u-1\right)}\!\left(0,x;\;\theta_{\textrm{p}}\right)-\mathbb{E}\!\left[\sum_{k=0}^{\tilde{n}-1}\!\left(\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}+\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right) \right.\right. \\[5pt] & \left.\left. \quad -\hat{V}\!\left(t_k,X^{\left(u-1\right)}_{t_k};\;\theta_{\textrm{v}}\right)\right)^2\Big\vert X^{\left(u-1\right)}_{0}=x\right],\end{align*}

where the estimated advantaged $\hat{A}^{\left(u-1\right)}_{\theta^{\left(u-1\right)}_{\textrm{p}},t_k}$ is evaluated at $\theta_{\textrm{p}}=\theta^{\left(u-1\right)}_{\textrm{p}}$ .

Hence, at each update step $u=1,2,\dots$ , based on the ANN weights $\theta^{\left(u-1\right)}$ , and thus, the policy $\pi\!\left(\cdot;\;\theta_{\textrm{p}}^{\left(u-1\right)}\right)$ , the estimated gradient of the surrogate performance measure is given by

\begin{align*}\widehat{\nabla_{\theta}\mathcal{J}^{\left(u-1\right)}\!\left(\theta^{\left(u-1\right)}\right)}=&\;\sum_{k=0}^{\tilde{n}-1}\!\left(\sum_{l=k}^{\tilde{n}-1}r_{t_{l+1}}^{\left(u-1\right)}-\hat{V}\!\left(t_k,x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)\right)\nabla_{\theta_{\textrm{p}}}\ln\phi\left(h_{t_k}^{\left(u-1\right)};\;x_{t_k}^{\left(u-1\right)},\theta_{\textrm{p}}^{\left(u-1\right)}\right)\\[5pt] &\;+\sum_{k=0}^{\tilde{n}-1}\!\left(\sum_{l=k}^{\tilde{n}-1}r_{t_{l+1}}^{\left(u-1\right)}-\hat{V}\!\left(t_k,x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)\right)\nabla_{\theta_{\textrm{v}}}\hat{V}\!\left(t_k,x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)\\[5pt] =&\;\sum_{k=0}^{\tilde{n}-1}\hat{a}^{\left(u-1\right)}_{t_k}\!\left(\nabla_{\theta_{\textrm{p}}}\ln\phi\left(h_{t_k}^{\left(u-1\right)};\;x_{t_k}^{\left(u-1\right)},\theta_{\textrm{p}}^{\left(u-1\right)}\right)+\nabla_{\theta_{\textrm{v}}}\hat{V}\!\left(t_k,x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)\right),\end{align*}

where $\hat{a}_{t_k}^{\left(u-1\right)}=\sum_{l=k}^{\tilde{n}-1}r_{t_{l+1}}^{\left(u-1\right)}-\hat{V}\!\left(t_k,x_{t_k}^{\left(u-1\right)};\;\theta_{\textrm{v}}^{\left(u-1\right)}\right)$ , for $k=0,1,\dots,\tilde{n}-1$ , is the realised estimated advantage.

Appendix C. Deep Hedging Training

The state vector observed by the DH agent is the same as that by the RL agent in Equation (18). Table C.1(a) summarises the hyperparameters of DH agent training, while Table C.1(b) outlines the hyperparameters of the ANN architecture of DH agent; see Appendix A.

Table C.1. The hyperparameters of deep hedging training and the neural network.

Footnotes

†

This work was first initiated by the authors at the Illinois Risk Lab in January 2020. This work was presented at the 2020 Actuarial Research Conference in August 2020, the United As One: 24th International Congress on Insurance: Mathematics and Economics in July 2021, the 2021 Actuarial Research Conference in August 2021, Heriot-Watt University in November 2021, University of Amsterdam in June 2022, and the 2022 Insurance Data Science Conference in June 2022. The authors thank the participants for fruitful comments. This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign. The authors are grateful to anonymous reviewers for their careful reading and insightful comments. The programming code is publicly available at the GitHub with the following link: https://github.com/yuxuanli-lyx/gmmb_gmdb_rl_hedging.

¹ Note that a “state” in this paper, in line with the terminologies of Markov decision processes, refers to observable metrics from the environment to be a proxy of a true state with unobservable but desirable features. See sections 2.3.2 and 7.1 for more details.

² The grid search was performed using the Hardware-Accelerated Learning cluster in the National Center for Supercomputing Applications; see Kindratenko et al. (Reference Kindratenko, Mu, Zhan, Maloney, Hashemi, Rabe, Xu, Campbell, Peng and Gropp2020).

References

Baydin, A.G., Pearlmutter, B.A., Radul, A.A. & Siskind, J.M. (2018). Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(2), 1–43.Google Scholar

Bertsimas, D., Kogan, L. & Lo, A.W. (2000). When is time continuous? Journal of Financial Economics, 5(2), 173–204.10.1016/S0304-405X(99)00049-5CrossRef Google Scholar

Bühler, H., Gonon, L., Teichmann, J. & Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8), 1271–1291.CrossRef Google Scholar

Cao, J., Chen, J., Hull, J. & Poulos, Z. (2021). Deep hedging of derivatives using reinforcement learning. Journal of Financial Data Science, 3(1), 10–27.CrossRef Google Scholar

Carbonneau, A. (2021). Deep hedging of long-term financial derivatives. Insurance: Mathematics and Economics, 99, 327–340.Google Scholar

Charpentier, A., Élie, R. & Remlinger, C. (2021). Reinforcement learning in economics and finance. Computational Economics.Google Scholar

Chen, B. & Hong, Y. (2012). Testing for the Markov property in time series. Econometric Theory, 28, 130–178.CrossRef Google Scholar

Chen, Z., Vetzal, K. & Forsyth, P. (2008). The effect of modelling parameters on the value of GMWB guarantees. Insurance: Mathematics and Economics, 43(1), 165–173.Google Scholar

Cheridito, P., Ery, J. & Wüthrich, M.V. (2020). Assessing asset-liability risk with neural networks. Risks, 8(1), article 16.CrossRef Google Scholar

Chong, W.F. (2019). Pricing and hedging equity-linked life insurance contracts beyond the classical paradigm: the principle of equivalent forward preferences. Insurance: Mathematics and Economics, 88, 93–107.Google Scholar

Cui, Z., Feng, R. & MacKay, A. (2017). Variable annuities with VIX-linked fee structure under a Heston-type stochastic volatility model. North American Actuarial Journal, 21(3), 458–483.CrossRef Google Scholar

Dai, M., Kwok, Y.K. & Zong, J. (2008). Guaranteed minimum withdrawal benefit in variable annuities. Mathematical Finance, 18(4), 595–611.CrossRef Google Scholar

Dang, O., Feng, M. & Hardy, M.R. (2020). Efficient nested simulation for conditional tail expectation of variable annuities. North American Actuarial Journal, 24(2), 187–210.CrossRef Google Scholar

Dang, O., Feng, M. & Hardy, M.R. (2022). Dynamic importance allocated nested simulation for variable annuity risk measurement. Annals of Actuarial Science, 16(2), 319–348.10.1017/S1748499521000257CrossRef Google Scholar

Feng, B.M., Tan, Z. & Zheng, J. (2020). Efficient simulation designs for valuation of large variable annuity portfolios. North American Actuarial Journal, 24(2), 275–289.CrossRef Google Scholar

Feng, R. (2018). An Introduction to Computational Risk Management of Equity-Linked Insurance. CRC Press, Boca Raton, Florida, U.S.10.1201/9781315151687CrossRef Google Scholar

Feng, R. & Yi, B. (2019). Quantitative modeling of risk management strategies: stochastic reserving and hedging of variable annuity guaranteed benefits. Insurance: Mathematics and Economics, 85, 60–73.Google Scholar

Gan, G. (2013). Application of data clustering and machine learning in variable annuity valuation. Insurance: Mathematics and Economics, 53(3), 795–801.Google Scholar

Gan, G. (2018). Valuation of large variable annuity portfolios using linear models with interactions. Risks, 6(3), 1–19.CrossRef Google Scholar

Gan, G. & Lin, X.S. (2015). Valuation of large variable annuity portfolios under nested simulation: a functional data approach. Insurance: Mathematics and Economics, 62, 138–150.Google Scholar

Gan, G. & Lin, X.S. (2017). Efficient Greek calculation of variable annuity portfolios for dynamic hedging: a two-level metamodeling approach. North American Actuarial Journal, 21(2), 161–177.CrossRef Google Scholar

Gan, G. & Valdez, E.A. (2017). Modeling partial Greeks of variable annuities with dependence. Insurance: Mathematics and Economics, 76, 118–134.Google Scholar

Gan, G. & Valdez, E.A. (2018). Regression modeling for the valuation of large variable annuity portfolios. North American Actuarial Journal, 22(1), 40–54.10.1080/10920277.2017.1366863CrossRef Google Scholar

Gan, G. & Valdez, E.A. (2020). Valuation of large variable annuity portfolios with rank order kriging. North American Actuarial Journal, 24(1), 100–117.CrossRef Google Scholar

Gao, G. & Wüthrich, M.V. (2019). Convolutional neural network classification of telematics car driving data. Risks, 7(1), article 6.10.3390/risks7010006CrossRef Google Scholar

Gweon, H., Li, S. & Mamon, R. (2020). An effective bias-corrected bagging method for the valuation of large variable annuity portfolios. ASTIN Bulletin: The Journal of the International Actuarial Association, 50(3), 853–871.10.1017/asb.2020.28CrossRef Google Scholar

Hardy, M. (2003). Investment Guarantees: Modeling and Risk Management for Equity-Linked Life Insurance. John Wiley & Sons, Inc., Hoboken, New Jersy, U.S.Google Scholar

Hasselt, H. (2010). Double Q-learning. In Advances in Neural Information Processing Systems, vol. 23.Google Scholar

Hejazi, S.A. & Jackson, K.R. (2016). A neural network approach to efficient valuation of large portfolios of variable annuities. Insurance: Mathematics and Economics, 70, 169–181.Google Scholar

Hu, C., Quan, Z. & Chong, W.F. (2022). Imbalanced learning for insurance using modified loss functions in tree-based models. Insurance: Mathematics and Economics, 106, 13–32.Google Scholar

Jeon, J. & Kwak, M. (2018). Optimal surrender strategies and valuations of path-dependent guarantees in variable annuities. Insurance: Mathematics and Economics, 83, 93–109.Google Scholar

Kindratenko, V., Mu, D., Zhan, Y., Maloney, J., Hashemi, S.H., Rabe, B., Xu, K., Campbell, R., Peng, J. & Gropp, W. (2020). HAL: computer system for scalable deep learning. In Practice and Experience in Advanced Research Computing (PEARC’20) (pp. 41–48).CrossRef Google Scholar

Kolm, P.N. & Ritter, G. (2019). Dynamic replication and hedging: a reinforcement learning approach. Journal of Financial Data Science, 1(1), 159–171.CrossRef Google Scholar

Lin, X.S. & Yang, S. (2020). Fast and efficient nested simulation for large variable annuity portfolios: a surrogate modeling approach. Insurance: Mathematics and Economics, 91, 85–103.Google Scholar

Liu, K. & Tan, K.S. (2020). Real-time valuation of large variable annuity portfolios: a green mesh approach. North American Actuarial Journal, 25(3), 313–333.CrossRef Google Scholar

Milevsky, M.A. & Posner, S.E. (2001). The Titanic option: valuation of the guaranteed minimum death benefit in variable annuities and mutual funds. The Journal of Risk and Insurance, 68(1), 93–128.CrossRef Google Scholar

Milevsky, M.A. & Salisbury, T.S. (2006). Financial valuation of guaranteed minimum withdrawal benefits. Insurance: Mathematics and Economics, 38(1), 21–38.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. arXiv: 1312.5602.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.CrossRef Google Scholar PubMed

Moenig, T. (2021a). Efficient valuation of variable annuity portfolios with dynamic programming. Journal of Risk and Insurance, 88(4), 1023–1055.CrossRef Google Scholar

Moenig, T. (2021b). Variable annuities: market incompleteness and policyholder behavior. Insurance: Mathematics and Economics, 99, 63–78.Google Scholar

Perla, F., Richman, R., Scognamiglio, S. & Wüthrich, M.V. (2021). Time-series forecasting of mortality rates using deep learning. Scandinavian Actuarial Journal, 7, 572–598.CrossRef Google Scholar

Quan, Z., Gan, G. & Valdez, E. (2021). Tree-based models for variable annuity valuation: parameter tuning and empirical analysis. Annals of Actuarial Science, 16(1), 95–118.CrossRef Google Scholar

Richman, R. & Wüthrich, M.V. (2021). A neural network extension of the Lee-Carter model to multiple populations. Annals of Actuarial Science, 15(2), 346–366.CrossRef Google Scholar

Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. (2015). Trust region policy optimization. arXiv: 1502.05477.Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv: 1707.06347.Google Scholar

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van den Driessche, G., Graepel, T. & Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, B, 354–359.CrossRef Google Scholar

Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts.Google Scholar

Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.CrossRef Google Scholar

Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, U.S.Google Scholar

Trottier, D.A., Godin, F. & Hamel, E. (2018). Local hedging of variable annuities in the presence of basis risk. ASTIN Bulletin: The Journal of the International Actuarial Association, 48(2), 611–646.CrossRef Google Scholar

Wang, G. & Zou, B. (2021). Optimal fee structure of variable annuities. Insurance: Mathematics and Economics, 101, 587–601.Google Scholar

Wang, H., Zariphopoulou, T. & Zhou, X. (2020). Reinforcement learning in continuous time and space: a stochastic control approach. Journal of Machine Learning Research, 21, 1–34.Google Scholar

Wang, H. & Zhou, X. (2020). Continuous-time mean-variance portfolio selection: a reinforcement learning framework. Mathematical Finance, 30(4), 1273–1308.CrossRef Google Scholar

Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.Google Scholar

Watkins, C.J.C.H. & Dayan, P. (1992). Q-learning. Machine Learning, 8, 297–292.10.1007/BF00992698CrossRef Google Scholar

Weaver, L. & Tao, N. (2001). The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. UAI’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 538–545.Google Scholar

Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.10.1007/BF00992696CrossRef Google Scholar

Wüthrich, M.V. (2018). Neural networks applied to chain-ladder reserving. European Actuarial Journal, 8, 407–436.CrossRef Google Scholar

Xu, W., Chen, Y., Coleman, C. & Coleman, T.F. (2018). Moment matching machine learning methods for risk management of large variable annuity portfolios. Journal of Economic Dynamics and Control, 87, 1–20.CrossRef Google Scholar

Xu, X. (2020). Variable Annuity Guaranteed Benefits: An Integrated Study of Financial Modelling, Actuarial Valuation and Deep Learning. PhD thesis, UNSW Business School.Google Scholar