Introduction
Shannon’s theory of communication (Shannon, Reference Shannon1948a; Reference Shannon1948b) demonstrated that the information transmitted between systems is a well-defined measurable quantity with fundamental limits. The two key elements of what became known as information theory are entropy and mutual information. The entropy of a system quantifies the uncertainty of the result of making an observation on a signal, and the mutual information quantifies how much of that uncertainty can be reduced by a related signal. From a statistical viewpoint, mutual information is a measure of the probabilistic dependence between univariate or multivariate random variables that is more general than, for example, Pearson’s correlation, which measures the linear association between two univariate random variables.
Given the wide applicability of information-theoretic quantities in physics, engineering, and the life sciences, it is important to have available accurate numerical estimators. Unfortunately, the underlying distributions are generally unknown, and nonparametric estimators are usually required, although they may be subject to error, especially with continuous multivariate random variables where the entropy becomes the differential entropy. For some existing estimators, their slow convergence with increasing sample size can be a serious challenge. But their accuracy may be improved by exploiting a simple decomposition of differential entropy. The method was introduced in Marín-Franch and Foster (Reference Marín-Franch and Foster2013) in an application to artificial image transformations.
The objective here is to illustrate an extension of the method to two real-world datasets and to describe software packages for estimating both differential entropy and mutual information in several computing environments. The two datasets consist of trivariate data from successive scene color images and univariate data from a stereophonic music recording. The first application is more detailed and contains illustrative code for the R computing environment (R Core Team, 2021).
Methods
 The nonparametric estimator described here is an offset version of a nearest-neighbor class of estimators for differential entropy (Berrett et al., Reference Berrett, Samworth and Yuan2019; Charzyńska & Gambin, Reference Charzyńska and Gambin2015; Goria et al., Reference Goria, Leonenko, Mergel and Novi Inverardi2005; Holmes & Nemenman, Reference Holmes and Nemenman2019; Kozachenko & Leonenko, Reference Kozachenko and Leonenko1987; Kraskov et al., Reference Kraskov, Stögbauer and Grassberger2004), namely the Kozachenko–Leonenko (KL) estimator (Goria et al., Reference Goria, Leonenko, Mergel and Novi Inverardi2005; Kozachenko & Leonenko, Reference Kozachenko and Leonenko1987). Although the offset method entails a decomposition into Gaussian and non-Gaussian components, the distribution of 
 $ X $
 is not itself assumed to be Gaussian or approximately Gaussian. Estimating the differential entropy
$ X $
 is not itself assumed to be Gaussian or approximately Gaussian. Estimating the differential entropy 
 $ h(X) $
 of a d-dimensional multivariate continuous random variable
$ h(X) $
 of a d-dimensional multivariate continuous random variable 
 $ X $
 proceeds as follows.
$ X $
 proceeds as follows.
- 
1. Estimate the differential entropy  $ {\hat{h}}_{\mathrm{G}}(X) $
 of a Gaussian distribution with the same covariance matrix, $ {\hat{h}}_{\mathrm{G}}(X) $
 of a Gaussian distribution with the same covariance matrix, $ c $
 say, as $ c $
 say, as $ X $
. $ X $
.
- 
2. Linearly transform  $ X $
 to form a new variable $ X $
 to form a new variable $ {X}^{\ast } $
 whose differential entropy $ {X}^{\ast } $
 whose differential entropy $ h\left({X}^{\ast}\right) $
 is such that $ h\left({X}^{\ast}\right) $
 is such that $ h(X)=h\left({X}^{\ast}\right)+{\hat{h}}_{\mathrm{G}}(X) $
. From the scaling property of differential entropy, the transformation, $ h(X)=h\left({X}^{\ast}\right)+{\hat{h}}_{\mathrm{G}}(X) $
. From the scaling property of differential entropy, the transformation, $ A $
 say, required for the equality to hold is given by $ A $
 say, required for the equality to hold is given by $ A={\left(2\pi e\right)}^{-1/2}{c}^{-1/2} $
, so that $ A={\left(2\pi e\right)}^{-1/2}{c}^{-1/2} $
, so that $ {X}^{\ast }= AX $
. $ {X}^{\ast }= AX $
.
- 
3. Apply the KL estimator to  $ {X}^{\ast } $
 and call the result $ {X}^{\ast } $
 and call the result $ {\hat{h}}_{\mathrm{KL}}\left({X}^{\ast}\right) $
. $ {\hat{h}}_{\mathrm{KL}}\left({X}^{\ast}\right) $
.
- 
4. Define the offset Kozachenko–Leonenko (KLo) estimator by  $ {\hat{h}}_{\mathrm{KL}\mathrm{o}}(X)={\hat{h}}_{\mathrm{KL}}\left({X}^{\ast}\right)+{\hat{h}}_{\mathrm{G}}(X) $
. $ {\hat{h}}_{\mathrm{KL}\mathrm{o}}(X)={\hat{h}}_{\mathrm{KL}}\left({X}^{\ast}\right)+{\hat{h}}_{\mathrm{G}}(X) $
.
Since the KL estimator is asymptotically unbiased (Goria et al., Reference Goria, Leonenko, Mergel and Novi Inverardi2005; Kozachenko & Leonenko, Reference Kozachenko and Leonenko1987), it follows that the offset version KLo is also asymptotically unbiased. For a proof, see Appendix A in the Supplementary Material, available on the Cambridge Core website. For a description of the software implementation, see Appendix B in the Supplementary Material.
Results
The first example illustrates the sample-size dependence of the KLo and KL estimators for a trivariate dataset, along with results for the popular Kraskov-Stögbauer-Grassberger (KSG) estimator of mutual information (Kraskov, Stögbauer, & Grassberger, Reference Kraskov, Stögbauer and Grassberger2004). Progressively larger samples, ranging from 210 to 219 points, were drawn randomly and identically from two trivariate images of a scene recorded at successive instants, about 1 min apart, shown in the thumbnail color images in Figure 1. The data were taken from a larger study (Foster, Reference Foster2021) where image values were expressed not as conventional RGB triplets but as LMS triplets, corresponding to activities in the long-, medium-, and short-wavelength-sensitive cone photoreceptors of the eye. The difference between RGB and LMS representations is immaterial for this illustration. Each image was stored as a 1,024 × 1,344 × 3 array, where the first two dimensions index pixel coordinates and the third dimension indexes LMS values, each obtained by integrating 12-bit spectral radiance data weighted by photoreceptor sensitivities. The frequency distributions of the LMS values were bimodal.

Figure 1. Estimates of mutual information between two color images. The thumbnail images are sRGB renderings (IEC, 1998) of the source data. The plots show mutual information estimates for the offset Kozachenko–Leonenko (KLo), Kozachenko–Leonenko (KL), and Kraskov-Stögbauer-Grassberger (KSG) estimators as a function of sample size. Standard deviations for the KLo and KL estimates ranged from about 0.1 with the smallest sample sizes to 0.006 with the largest sample sizes. Standard deviations for the KSG estimates were a little smaller.
The main panel in Figure 1 shows the KLo, KL, and KSG estimates of the mutual information plotted against the sample size. Each curve is an average of over 100 repeated random samples. The KLo estimate rapidly asymptotes with increasing sample size, unlike the KL and KSG estimates, which continue to increase even as sample size approaches the maximum available determined by image size. The Gaussian component of the KLo estimator was about 8.0 bits.
The second example is described in Appendix C in the Supplementary Material, available on the Cambridge Core website. It illustrates the similarity of the KLo and KL estimates and the failure of the KSG estimate with a univariate dataset.
Discussion
The slow convergence of mutual information estimators with increasing sample size is not inevitable. The offset method can clearly improve the convergence of estimates derived with the Kozachenko–Leonenko estimator with some real-world datasets. But the extent of the improvement does depend on the size of the Gaussian component of the underlying differential entropies. For distributions very far from Gaussian, there is no guarantee that the offset method will converge faster than applying the KL estimator directly. The offset method does, though, have the advantage of automatically adjusting itself to the properties of the distributions. It is, moreover, neutral with respect to the choice of differential entropy estimator, so that any other estimator can instead be plugged in.
The present approach may be open to generalization. One possibility is to replace the particular linear transformation used to decompose differential entropy into Gaussian and non-Gaussian components by other transformations. Another possibility is to extend the offset method to estimating related information-theoretic quantities such as Kullback–Liebler divergence and cross-entropy.
Acknowledgment
We are grateful to P. A. Gaydecki for providing the soundtrack samples and to S. M. C. Nascimento and K. Amano for collaborating in producing the hyperspectral images for the color data.
Supplementary Materials
To view supplementary material for this article, please visit http://doi.org/10.1017/exp.2022.14.
Data availability statement
The code is copyrighted by I.M.-F. and D.H.F. and distributed under Apache License v2.0. Both the code and data used in this manuscript are available for R, Python, and MATLAB computing environments at https://github.com/imarinfr/klo.
Funding statement
This work was supported by Computational Optometry (Atarfe, Spain; https://www.optocom.es/). The code was developed as part of two EPSRC grants (EP/B000257/1 and EP/E056512/1).
Conflicts of interest
The authors declare no conflicts of interest.
Authorship contributions
Conceptualization: I.M.-F. and D.H.F.; Methodology: I.M.-F.; Software: I.M.-F. and M.S.-S.; Validation: all authors; Visualization: D.H.F.; Writing—original draft: I.M.-F. and D.H.F.; Writing—review and editing: I.M.-F. and D.H.F.
 
  
 
 
 
 
 
 
 
 
 
 
 
 

 
              
Comments
Comments to the Author: Review of "Offset estimator of differential entropy and mutual information with multivariate data"
Authors considered an estimators already published in Marin-French & Foster (2013), which itself is based on Kozachenko-Leonenko (1987) estimator. From 2013 until today, I think that the estimator has been proved in several experiments. Thus, what is the real contribution of the paper? it is the public programming codes (Matlab-R-Python)? or the application?
Decomposition on a Gaussian and non-Gaussian component is an important step of the proposed method, which has widely considered in the literature for differential entropy and mutual information. From the references added in the manuscript, there exist(s) some(s) of them that considered this issue?
About the results, in Appendix A (of supplementary material) is obtained the limits for KL and KLo, where for a large sample size, they converge to differential entropy H. However, in Fig. 1, why not occurs the same for KL and KLo in log_2(n) ~ 19? Also, why is the intention of authors in including these both images? what is the difference among the images?