# Entropy in Statistical Inference and Prediction

Original by Chris Hillman (Last modified by Chris Hillman 2 Feb 2001.)

Ronald A. Fisher introduced a notion of information into the theory of statistical inference in 1925. This Fisher information is now understood to be closely related to Shannon's notion of entropy. In 1951, Solomon Kullback introduced the divergence between two random variables (this quantity is also called the cross entropy, discrimination, Kullback-Liebler entropy, etc.) and found another connection between Shannon's entropy and the theory of statistical inference.

For a modern development along these lines, see the following paper:

• Prediction and Information Theory, by John Kieffer (Electrical Engineering, University of Minnesota), is a nice expository paper offering a good short introduction to the intimate connection between these two topics.
• `Universal Prediction of Individual Sequences', by Meir Feder, Neri Merhav and Michael Gutman, IEEE Trans. Information Theory 38 (1992): 1258--1270, won the the 1994 Information Theory Society Prize. In a second, more informal paper, Reflections on "Universal Prediction of Individual Sequences". the three authors describe how they came to write the first paper. Briefly, their starting point was the observation that the intuition (due to Shannon himself) that the entropy of an information source measures how well its behavior (e.g. the next symbol in a sequence it produced) can be predicted, together with the existence of Universal encoders (e.g. the Lempel-Ziv algorithm; see the talk by Wyner listed above) suggests that there should be an algorithm for predicting the next symbol in a sequence which is guaranteed to become as accurate as desired, for any information source, provided you are willing to wait long enough (for the algorithm to "train itself", if you will).

In 1957, Edwin Jaynes introduced the fundamental Principle of Maximal Entropy. A little later, this was subsumed by the more general Principle of Minimal Divergence. Over the last several decades, Jaynes and his followers have attempted to develop a Bayseian theory of probability as a "degree of belief", based upon the Principle of Maximal Entropy. This program remains highly controversial among probabilists; the philosophical issues involved are thorny and subtle.

Here are some places you can learn more about the Principle of Maximal Entropy and its many applications:

• Maximal Entropy and Bayesian Probability Theory, a collection of expository sketches and tutorials by someone in the CEMS research group in the Chemical Science and Technology Division, Los Alamos National Laboratory. See especially the excellent tutorial on Inverse Problems and Surprisal Analyisis in Physics.
• Probablity Theory: The Logic of Science. The substantially complete draft of Edwin Jayne's enormous book on probability theory as a science of plausible inference. If you have the slightest interest in probability, the philosophy of mathematics or information theory, you should take a look at this fascinating and provocative book! See particularly Chapter 11 (Entropy Principle), Chapter 27 (Communication Theory) and Chapter 29 (Statistical Mechanics). Written for undergraduates, but some chapters are fairly demanding. (The book is downloadable chapter by chapter as postscript files, with illustrations included.)

Recently a theory of statistical manifolds has been developed, in which entropy appears as a geometric quantity related to curvature. Some idea of how this works can be gained from the following expository paper:

• From Euclid to Entropy by Carlos Rodriguez (Statistics, SUNY Albany). Did you know that statistical inference and discrimination are related to the cross-ratio studied in projective geometry? I didn't! In this paper, Rodriguez explains at the undergraduate level why it reasonable to expect at least a "spiritual relation".
Meanwhile, a whole literature on the important problem of estimating entropies from noisy data has arisen. The papers of David Wolf (Physics, University of Texas at Austin) discuss Bayesian estimators of various entropies.