Quantcast
Viewing latest article 10
Browse Latest Browse All 10

Maximum entropy from Bayes’ theorem

The principle of maximum entropy asserts that when trying to determine an unknown probability distribution (for example, the distribution of possible results that occur when you toss a possibly unfair die), you should pick the distribution with maximum entropy consistent with your knowledge.

The goal of this post is to derive the principle of maximum entropy in the special case of probability distributions over finite sets from

  • Bayes’ theorem and
  • the principle of indifference: assign probability Image may be NSFW.
    Clik here to view.
    \frac{1}{n}
    to each of Image may be NSFW.
    Clik here to view.
    n
    possible outcomes if you have no additional knowledge. (The slogan in statistical mechanics is “all microstates are equally likely.”)

We’ll do this by deriving an arguably more fundamental principle of maximum relative entropy using only Bayes’ theorem.

A better way to state Bayes’ theorem

Suppose you have a set Image may be NSFW.
Clik here to view.
I
of hypotheses about something, exactly one of which can be true, and some prior probabilities Image may be NSFW.
Clik here to view.
\mathbb{P}(i), i \in I
that these hypotheses are true (which therefore sum to Image may be NSFW.
Clik here to view.
1
). Then you see some evidence Image may be NSFW.
Clik here to view.
E
. (Here is a simultaneous definition of both hypotheses and evidence: hypotheses are things that assert how likely or unlikely evidence is. That is, what it means to give evidence about some hypotheses is that there ought to be some conditional probabilities Image may be NSFW.
Clik here to view.
\mathbb{P}(E | i)
, the likelihoods, describing how likely it is that you see evidence Image may be NSFW.
Clik here to view.
E
conditional on hypothesis Image may be NSFW.
Clik here to view.
i
.)

Bayes’ theorem in this setting is then usually stated as follows: you should now have updated posterior probabilities Image may be NSFW.
Clik here to view.
\mathbb{P}(i | E)
that your hypotheses are true conditional on your evidence, and they should be given by

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(i | E) = \frac{\mathbb{P}(i) \mathbb{P}(E | i)}{\mathbb{P}(E)}
.

That is, each prior probability Image may be NSFW.
Clik here to view.
\mathbb{P}(i)
gets multiplied by Image may be NSFW.
Clik here to view.
\frac{\mathbb{P}(E | i)}{\mathbb{P}(E)}
, which describes how much more likely Image may be NSFW.
Clik here to view.
i
thinks the evidence Image may be NSFW.
Clik here to view.
E
is than before. You might be concerned that Image may be NSFW.
Clik here to view.
\mathbb{P}(E)
requires the introduction of extra information, but in fact it must be given by

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(E) = \sum_i \mathbb{P}(i) \mathbb{P}(E | i)

by conditioning on each Image may be NSFW.
Clik here to view.
i
in turn, so it’s already determined by the priors and the likelihoods. (This is if the Image may be NSFW.
Clik here to view.
i
are parameterized by a discrete parameter Image may be NSFW.
Clik here to view.
i
; in general this sum should be replaced by an integral.)

In practice this statement of Bayes’ theorem seems to be annoyingly easy to forget, at least for me. Here is a better statement. The idea is to think of Image may be NSFW.
Clik here to view.
\mathbb{P}(E)
as just a normalization constant. Hence the revised statement is

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(i | E) \propto \mathbb{P}(i) \mathbb{P}(E | i)
.

That is, the posterior probability is proportional to the prior probability times the likelihood, where the proportionality constant is uniquely determined by the requirement that the probabilities sum to Image may be NSFW.
Clik here to view.
1
.

Intuitively: after seeing some evidence, your confidence in a hypothesis gets multiplied by how well the hypothesis predicted the evidence, then normalized. Now you can take your posteriors to be your new priors in preparation for seeing some more evidence. This is a Bayesian update.

Aside: measures up to scale and improper priors

This statement of Bayes’ theorem suggests a slight reformulation of what we mean by a probability measure: a probability measure is the same thing as a measure with nonzero total measure, up to scaling by positive reals. One reason to like this description is that it naturally incorporates improper priors, which correspond to prior probabilities Image may be NSFW.
Clik here to view.
\mathbb{P}(i)
 with possibly infinite total measure, up to scaling by positive reals. The point is that after a Bayesian update an improper prior may become proper again. For example, there’s an improper prior assigning measure Image may be NSFW.
Clik here to view.
1
to every positive integer Image may be NSFW.
Clik here to view.
n
, which allows us to talk about hypotheses Image may be NSFW.
Clik here to view.
n \in \mathbb{N}
indexed by the positive integers and with a prior which makes all of them equally likely.

Improper priors may seem obviously bad because they don’t assign probabilities to things: in order to assign a probability you need to normalize by the total measure, which is infinite. However, with an improper prior it is still meaningful to make comparisons between probabilities: you can still meaningfully say that Image may be NSFW.
Clik here to view.
\mathbb{P}(i)
is larger than Image may be NSFW.
Clik here to view.
\mathbb{P}(j)
, or exactly Image may be NSFW.
Clik here to view.
r
times Image may be NSFW.
Clik here to view.
\mathbb{P}(j)
, since this comparison is invariant under scaling by positive reals.

There’s a somewhat philosophical argument that when performing Bayesian reasoning, only comparisons between probabilities are meaningful anyway: in order to know the probability, in the absolute sense, of something, you need to be absolutely sure you’ve written down every possible hypothesis Image may be NSFW.
Clik here to view.
i
(in order to ensure that exactly one of them is true). If you leave out the true hypothesis, then you might end up being more and more sure of an arbitrarily bad hypothesis because the true hypothesis wasn’t included in your calculations. In other words, computing the normalization constant Image may be NSFW.
Clik here to view.
\mathbb{P}(E)
in the usual statement of Bayes’ theorem is “global” in that it requires information about all of the Image may be NSFW.
Clik here to view.
i
, but computing Image may be NSFW.
Clik here to view.
\mathbb{P}(i) \mathbb{P}(E | i)
is “local” in that it only involves one Image may be NSFW.
Clik here to view.
i
at a time.

(And it’s not enough just to have a few hypotheses and then a catch-all hypothesis called “everything else,” because “everything else” is not a hypothesis in the sense that it does not assign likelihoods Image may be NSFW.
Clik here to view.
\mathbb{P}(E | i)
. A hypothesis has to make predictions.)

The setup

Back to maximum entropy. Imagine that you are repeatedly rolling an Image may be NSFW.
Clik here to view.
n
-sided die, and you don’t know what the various weights on the die are: that is, you don’t know the true probabilities Image may be NSFW.
Clik here to view.
p_k
 that the Image may be NSFW.
Clik here to view.
k^{th}
face of the die will come up.

However, you have some hypotheses about these probabilities. Your hypotheses Image may be NSFW.
Clik here to view.
t \in T
are parameterized by a parameter Image may be NSFW.
Clik here to view.
t
, which for the sake of concreteness we’ll take to be a real number or a tuple of real numbers, but which could in principle be anything. Your hypotheses assign probability Image may be NSFW.
Clik here to view.
p_k(t)
to the Image may be NSFW.
Clik here to view.
k^{th}
face coming up. You also have some prior over your hypotheses, which we’ll write as a probability density function Image may be NSFW.
Clik here to view.
f(t)
. Hence Image may be NSFW.
Clik here to view.
f(t) \ge 0
and

Image may be NSFW.
Clik here to view.
\displaystyle \int f(t) \, dt = 1

while the Image may be NSFW.
Clik here to view.
p_k(t)
are normalized so that

Image may be NSFW.
Clik here to view.
\displaystyle \sum_k p_k(t) = 1
.

Example. If Image may be NSFW.
Clik here to view.
n = 2
, we might imagine that we’re flipping a coin, with Image may be NSFW.
Clik here to view.
p_0
the probability that we flip tails and Image may be NSFW.
Clik here to view.
p_1
the probability that we flip heads. Our hypotheses might take the form Image may be NSFW.
Clik here to view.
p_0(t) = t, p_1(t) = 1 - t
where Image may be NSFW.
Clik here to view.
t \in [0, 1]
, and our prior might be the uniform prior: each Image may be NSFW.
Clik here to view.
t
is equally likely. Hence our probability density is Image may be NSFW.
Clik here to view.
f(t) = 1
.

Now suppose you roll the die Image may be NSFW.
Clik here to view.
N
times. What happens to your beliefs under Bayesian updating in the limit Image may be NSFW.
Clik here to view.
N \to \infty
?

The principle of maximum relative entropy

Suppose you see the Image may be NSFW.
Clik here to view.
k^{th}
face come up Image may be NSFW.
Clik here to view.
N q_k
times (so the Image may be NSFW.
Clik here to view.
q_k
are nonnegative and Image may be NSFW.
Clik here to view.
\sum_k q_k = 1
; they described observed relative frequencies of the various faces coming up, and altogether describe the empirical probability distribution). Hypothesis Image may be NSFW.
Clik here to view.
t
predicts that this happens with probability

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(q | t) = \frac{N!}{\prod_k (N q_k)!} \prod_k p_k(t)^{N q_k}
.

Let’s see how this function behaves as Image may be NSFW.
Clik here to view.
N \to \infty
. Taking the log, and using Stirling’s approximation in the form

Image may be NSFW.
Clik here to view.
\log N! \approx N \log N - N

we get

Image may be NSFW.
Clik here to view.
\displaystyle \log \frac{N!}{\prod_k (N q_k)!} \approx \left( N \log N - N \right) - \sum_k \left( N q_k \log N q_k - N q_k \right)
.

Various terms cancel here due to the fact that Image may be NSFW.
Clik here to view.
\sum_k q_k = 1
. At the end of the day we get

Image may be NSFW.
Clik here to view.
\displaystyle \log \frac{N!}{\prod_k (N q_k)!} \approx - N \left( \sum_k q_k \log q_k \right) = N H(q)
.

This is the first apperarance of the function Image may be NSFW.
Clik here to view.
H(q) = - \sum_k q_k \log q_k
, the entropy of Image may be NSFW.
Clik here to view.
q
, regarded as a probability distribution over faces. This is perhaps the most concrete and least mysterious way of introducing entropy: it’s a concise way of summarizing the asymptotic behavior of the multinomial distribution as Image may be NSFW.
Clik here to view.
N \to \infty
. Already we see that the entropy being larger corresponds to the counts Image may be NSFW.
Clik here to view.
N q_k
being more likely, in a very serious way.

But there’s a second term in the likelihood, so let’s compute the logarithm of that too. This gives

Image may be NSFW.
Clik here to view.
\displaystyle \log p_k(t)^{N q_k} = N \sum_k q_k \log p_k(t)
.

Thus the logarithm of the likelihood, or log-likelihood, is

Image may be NSFW.
Clik here to view.
\displaystyle \log \mathbb{P}(q | t) \approx N \left( \sum_k q_k \log \frac{p_k(t)}{q_k} \right)
.

Now the function that appears is the negative of the Kullback-Leibler divergence. We’ll call it the relative entropy (although this term is sometimes used for the KL divergence, not its negative) and denote it somewhat arbitrarily by Image may be NSFW.
Clik here to view.
H(p(t) \to q)
.

Altogether, the posterior density is now proportional to

Image may be NSFW.
Clik here to view.
\displaystyle f(t | q) \propto f(t) \exp \left( N H(p(t) \to q) \right)
.

From here it’s not hard to see that the posterior density is overwhelmingly concentrated at the hypotheses Image may be NSFW.
Clik here to view.
t
that maximize relative entropy Image may be NSFW.
Clik here to view.
H(p(t) \to q)
as Image may be NSFW.
Clik here to view.
N \to \infty
, subject to the constraint that the prior density Image may be NSFW.
Clik here to view.
f(t)
is positive. This is because all the other posterior densities are exponentially smaller in comparison, and as long as the prior density Image may be NSFW.
Clik here to view.
f(t)
is positive, it doesn’t matter what its exact value is because it too is exponentially small in comparison to the main exponential term.

This calculation suggests that we can interpret the relative entropy Image may be NSFW.
Clik here to view.
H(p(t) \to q)
as a measure of how well the hypothesis Image may be NSFW.
Clik here to view.
p(t)
fits the evidence Image may be NSFW.
Clik here to view.
q
: the larger this number is, the better the fit. (A more common way to describe relative entropy is as a measure of how well a hypothesis fits the “truth.” Here our model for being told that Image may be NSFW.
Clik here to view.
q
is the “truth” is seeing it asymptotically as Image may be NSFW.
Clik here to view.
N \to \infty
.)

Let’s wrap that conclusion up into a theorem.

Theorem: With hypotheses as above, as Image may be NSFW.
Clik here to view.
N \to \infty
, Bayesian updates converge towards believing the hypothesis Image may be NSFW.
Clik here to view.
t
that maximizes the relative entropy Image may be NSFW.
Clik here to view.
H(p(t) \to q)
subject to the constraint that the prior density Image may be NSFW.
Clik here to view.
f(t)
is positive.

Now, suppose the true probabilities are Image may be NSFW.
Clik here to view.
p_k
. Then as Image may be NSFW.
Clik here to view.
N \to \infty
we expect, by the law of large numbers, that the observed frequencies Image may be NSFW.
Clik here to view.
q_k
approach the true probabilities Image may be NSFW.
Clik here to view.
p_k
. If the true probabilities are among our hypotheses Image may be NSFW.
Clik here to view.
p_k(t)
, we would hope, and it seems intuitively clear, that we’ll converge towards believing the true hypothesis. This requires showing the following.

Theorem: The relative entropy Image may be NSFW.
Clik here to view.
H(p \to q)
is nonpositive, and for fixed Image may be NSFW.
Clik here to view.
q
, it takes its maximum value Image may be NSFW.
Clik here to view.
0
iff Image may be NSFW.
Clik here to view.
p = q
.

Proof. This is more or less a computation with Lagrange multipliers. For fixed Image may be NSFW.
Clik here to view.
q
, we want to maximize

Image may be NSFW.
Clik here to view.
\displaystyle H(p \to q) = \sum q_k \log \frac{p_k}{q_k}

subject to the constraint Image may be NSFW.
Clik here to view.
\sum p_k = 1
. This constraint means that at a critical point of Image may be NSFW.
Clik here to view.
H
(whether a maximum, a minimum, or a saddle point), all of the partial derivatives Image may be NSFW.
Clik here to view.
\frac{\partial H}{\partial p_k}
should be equal. (Intuitively, we have a “budget” of probability to spend to increase Image may be NSFW.
Clik here to view.
H
, and as we spend more probability on one Image may be NSFW.
Clik here to view.
p_k
we necessarily must spend less probability on the others. The critical points are then the points where we can’t do any better by shifting our probability budget, meaning that the marginal value of each probability increase is equally good.)

We compute that

Image may be NSFW.
Clik here to view.
\displaystyle \frac{\partial H}{\partial p_k} = \frac{q_k}{p_k}

so setting all partial derivatives equal we conclude that Image may be NSFW.
Clik here to view.
p_k
must be proportional to Image may be NSFW.
Clik here to view.
q_k
, and the additional constraint Image may be NSFW.
Clik here to view.
\sum p_k = 1
gives Image may be NSFW.
Clik here to view.
p_k = q_k
for all Image may be NSFW.
Clik here to view.
k
.

At this critical point Image may be NSFW.
Clik here to view.
H
takes value Image may be NSFW.
Clik here to view.
H(p \to p) = 0
. Now we need to show that this critical point is a maximum and not a minimum. Since it’s the unique critical point, it suffices to show that it’s a local maximum. So, consider a point Image may be NSFW.
Clik here to view.
p_k = q_k + \epsilon_k
in a small neighborhood of this critical point, where Image may be NSFW.
Clik here to view.
\sum \epsilon_k = 0
. To second order, we have

Image may be NSFW.
Clik here to view.
\displaystyle \log \frac{q_k + \epsilon_k}{q_k} = \frac{\epsilon_k}{q_k} - \frac{\epsilon_k^2}{2q_k^2} + O \left( \frac{\epsilon_k^3}{q_k^3} \right)^3

and hence

Image may be NSFW.
Clik here to view.
\displaystyle H(p \to q) = \sum_k \left( \epsilon_k - \frac{\epsilon_k^2}{2q_k} \right) + O \left( \sum_k \left( \frac{\epsilon_k}{q_k} \right)^3 \right).

The linear term Image may be NSFW.
Clik here to view.
\sum_k \epsilon_k
vanishes, and the quadratic term is negative definite as desired. (Strictly speaking we need to require that the Image may be NSFW.
Clik here to view.
q_k
are all positive, but if any of them happen to be zero then the corresponding value of Image may be NSFW.
Clik here to view.
k
can be safely ignored anyway, since it won’t figure in any of our computations.) Image may be NSFW.
Clik here to view.
\Box

Corollary: With hypotheses as above, as Image may be NSFW.
Clik here to view.
N \to \infty
, if the true hypothesis is among the hypotheses with positive density, Bayesian updates converge towards believing it. False hypotheses are disbelieved at an exponential rate with base the exponential of the relative entropy.

In other words, as Image may be NSFW.
Clik here to view.
N \to \infty
, the Bayesian definition of probability converges to the frequentist definition of probability.

Example. Let’s return to the Image may be NSFW.
Clik here to view.
n = 2
example, where we’re flipping a coin with an unknown bias Image may be NSFW.
Clik here to view.
t \in [0, 1]
, so that Image may be NSFW.
Clik here to view.
p_0(t) = t, p_1(t) = 1 - t
are the probabilities of flipping heads and tails respectively given bias Image may be NSFW.
Clik here to view.
t
, and our prior is uniform. Suppose that after Image may be NSFW.
Clik here to view.
N
trials we observe Image may be NSFW.
Clik here to view.
N s
heads and Image may be NSFW.
Clik here to view.
N (1 - s)
tails, where Image may be NSFW.
Clik here to view.
s \in [0, 1]
. Then

Image may be NSFW.
Clik here to view.
\displaystyle f_{\text{post}}(t) \propto t^{Ns} (1 - t)^{N(1-s)}
.

(We can drop the binomial coefficient because it’s the same for all values of Image may be NSFW.
Clik here to view.
t
and so can be absorbed into our proportionality constant. We introduced it into the above computation because it becomes important later.)

This computation can be used to deduce the rule of succession, which asserts that at this point you should assign probability Image may be NSFW.
Clik here to view.
\frac{Ns + 1}{N + 1}
to heads coming up on the next coin flip. Note that as Image may be NSFW.
Clik here to view.
N \to \infty
this converges to Image may be NSFW.
Clik here to view.
s
.

The posterior density can be written as

Image may be NSFW.
Clik here to view.
\displaystyle f_{\text{post}}(t) \propto \exp \left( N (s \log t + (1 - s) \log (1 - t)) \right)

which takes its maximum value when Image may be NSFW.
Clik here to view.
t = s
by our results above, although in this case one-variable calculus suffices to prove this. Near this maximum value, Taylor expanding around Image may be NSFW.
Clik here to view.
t
shows that, for values of Image may be NSFW.
Clik here to view.
t
sufficiently close to Image may be NSFW.
Clik here to view.
s
, the posterior density is approximately a Gaussian centered at Image may be NSFW.
Clik here to view.
s
with standard deviation Image may be NSFW.
Clik here to view.
O \left( \frac{1}{\sqrt{N}} \right)
. Hence in order to be confident that the true bias lies in an interval of size Image may be NSFW.
Clik here to view.
\epsilon
with high probability we need to look at Image may be NSFW.
Clik here to view.
O \left( \frac{1}{\epsilon^2} \right)
coin flips.

Maximum entropy

We were supposed to get a criterion in terms of maximizing entropy, not relative entropy. What happened to that?

Now instead of knowing the relative frequencies Image may be NSFW.
Clik here to view.
q_k
, let’s assume that we only know that they satisfy some conditions. For example, for any function Image may be NSFW.
Clik here to view.
X : \{ 1, 2, \dots n \} \to V
, where Image may be NSFW.
Clik here to view.
V
is a finite-dimensional real vector space, we might know the expected value

Image may be NSFW.
Clik here to view.
\displaystyle \langle X \rangle = \sum_k X_k q_k \in V

of Image may be NSFW.
Clik here to view.
X
with respect to the empirical probability distribution. In statistical mechanics a typical and important example is that we might know the average energy. We also might be observing a random walk, and while we don’t know how many steps the random walker took in a given direction (perhaps because they’re moving too fast for us to see), we might know where they ended up after Image may be NSFW.
Clik here to view.
N
steps, which tells us the average of all the steps the walker took.

(Strictly speaking, if we’re talking about the empirical distribution, where in particular each Image may be NSFW.
Clik here to view.
q_k
is necessarily rational, it’s too much to ask that any particular condition be exactly satisfied. We’d be happy to see that it’s asymptotically satisfied as Image may be NSFW.
Clik here to view.
N \to \infty
, from which we’re concluding that our conditions are exactly satisfied for the true probabilities Image may be NSFW.
Clik here to view.
p_k
, or something like that. It seems there’s something subtle going on here and I am going to completely ignore it.)

Knowing the expected values of some functions is equivalent to knowing that the empirical distribution Image may be NSFW.
Clik here to view.
q
lies in some affine subspace of the probability simplex

Image may be NSFW.
Clik here to view.
\displaystyle \Delta^{n-1} = \{ q \in \mathbb{R}^n : q_k \ge 0, \sum q_k = 1 \}
.

However, more complicated constraints are possible. For example, suppose that Image may be NSFW.
Clik here to view.
n = n_1 n_2
and that we’re really rolling two independent dice with Image may be NSFW.
Clik here to view.
n_1
and Image may be NSFW.
Clik here to view.
n_2
sides, respectively, so that we can relabel the possible outcomes with pairs

Image may be NSFW.
Clik here to view.
(k_1, k_2) \in \{ 1, 2, \dots n_1 \} \times \{ 1, 2, \dots n_2 \}
.

Observing this is true means that, at least asymptotically as Image may be NSFW.
Clik here to view.
N \to \infty
, we observe that we can write Image may be NSFW.
Clik here to view.
q_{(k_1, k_2)} = r_{k_1} s_{k_2}
, where

Image may be NSFW.
Clik here to view.
\displaystyle r_{k_1} = \sum_{k_2} q_{(k_1, k_2)}

is the empirical probability that the first die comes up Image may be NSFW.
Clik here to view.
k_1
, and similarly

Image may be NSFW.
Clik here to view.
\displaystyle s_{k_2} = \sum_{k_1} q_{(k_1, k_2)}

is the empirical probability that the second die comes up Image may be NSFW.
Clik here to view.
k_2
. This is a nonlinear constraint: in fact it describes a collection of quadratic equations that the variables Image may be NSFW.
Clik here to view.
q_{(k_1, k_2)}
must satisfy. Imposing these equations turns out to be equivalent to imposing the simpler homogeneous quadratic equations

Image may be NSFW.
Clik here to view.
\displaystyle q_{k_1, k_2} q_{k_3, k_4} = q_{k_1, k_4} q_{k_3, k_2}

which we might recognize as the equations cutting out the image of the Segre embedding

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}^{n_1 - 1} \times \mathbb{P}^{n_2 - 1} \to \mathbb{P}^{n_1 n_2 - 1}
.

The idea is to think of the probability simplex Image may be NSFW.
Clik here to view.
\Delta^d
as sitting inside projective space Image may be NSFW.
Clik here to view.
\mathbb{P}^d
; then the restriction of the Segre embedding to probability simplices produces a map

Image may be NSFW.
Clik here to view.
\displaystyle \Delta^{n_1 - 1} \times \Delta^{n_2 - 1} \to \Delta^{n_1 n_2 - 1}

describing how a probability distribution Image may be NSFW.
Clik here to view.
r_k
over the first die and a probability distribution Image may be NSFW.
Clik here to view.
s_k
over the second die gives rise to a joint probability distribution Image may be NSFW.
Clik here to view.
q_{(k_1, k_2)} = r_{k_1} s_{k_2}
over both of them. More complicated variations of this example are considered in algebraic statistics.

In any case, the game is that instead of knowing the empirical distribution we now only know some conditions it satisfies. Write the set of all distributions satisfying these conditions as Image may be NSFW.
Clik here to view.
E
. What happens as Image may be NSFW.
Clik here to view.
N \to \infty
? Hypothesis Image may be NSFW.
Clik here to view.
t
still predicts that empirical distribution Image may be NSFW.
Clik here to view.
q
 occurs with probability

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(q | t) = \frac{N!}{\prod_k (N q_k)!} \prod_k p_k(t)^{N q_k}

and hence it predicts that we observe that our conditions are satisfied with probability

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(E | t) = \sum_{q \in E} \frac{N!}{\prod_k (N q_k)!} \prod_k p_k(t)^{N q_k}

(where Image may be NSFW.
Clik here to view.
E
is shorthand for the event Image may be NSFW.
Clik here to view.
q \in E
). Using our previous approximations, we can rewrite this as

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(E | t) \approx \sum_{q \in E} \exp N H(p(t) \to q)

which gives posterior densities

Image may be NSFW.
Clik here to view.
\displaystyle f_{\text{post}}(t) \propto f(t) \sum_{q \in E} \exp N H(p(t) \to q)
.

As before, we find that the posterior densities are overwhelmingly concentrated at the hypotheses Image may be NSFW.
Clik here to view.
p(t)
that maximize relative entropy Image may be NSFW.
Clik here to view.
H(p(t) \to q)
as Image may be NSFW.
Clik here to view.
N \to \infty
(again subject to the constraint that Image may be NSFW.
Clik here to view.
f(t) > 0
), but where Image may be NSFW.
Clik here to view.
q
is now allowed to run over all Image may be NSFW.
Clik here to view.
q \in E
.

If our prior assigns nonzero density to every possible probability distribution in the probability simplex (for simple, we could take Image may be NSFW.
Clik here to view.
t
to be parameterized by the points of the probability simplex, Image may be NSFW.
Clik here to view.
p(t)
to be the probability distribution corresponding to the point Image may be NSFW.
Clik here to view.
t
, and Image may be NSFW.
Clik here to view.
f(t)
to be a constant, suitably normalized), then we know that relative entropy takes its maximum value Image may be NSFW.
Clik here to view.
0
when its two arguments are equal, so we can restrict our attention to the case that Image may be NSFW.
Clik here to view.
p(t) = q
above, and we find that, asymptotically as Image may be NSFW.
Clik here to view.
N \to \infty
, the posterior density is proportional to the prior density Image may be NSFW.
Clik here to view.
f(t)
 as long as Image may be NSFW.
Clik here to view.
p(t)
satisfies the conditions, and Image may be NSFW.
Clik here to view.
0
otherwise.

This is unsurprising: we assumed that all we were told about the empirical distribution is that it satisfied some conditions, so the only change we make to our prior is that we condition on that.

We still haven’t gotten a characterization in terms of entropy, as opposed to relative entropy. This is where we are going to invoke the principle of indifference, which in this situation asserts that the prior we should have is the one concentrated entirely at the hypothesis

Image may be NSFW.
Clik here to view.
\displaystyle p_k = \frac{1}{n}

that the die rolls are being generated uniformly at random. Note that this means the posterior is also concentrated entirely at this hypothesis!

We now predict empirical probability distribution Image may be NSFW.
Clik here to view.
q
with probability distributed according to the multinomial distribution, namely

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(q) = \frac{1}{n^N} \frac{N!}{\prod_k (N q_k)!} \approx \frac{1}{n^N} \exp N H(q)

where Image may be NSFW.
Clik here to view.
\displaystyle \frac{1}{n^N}
is now a normalization constant and can be ignored, and Image may be NSFW.
Clik here to view.
H(q) = - \sum q_k \log q_k
is the entropy, rather than the relative entropy. This comes from substituting the uniform distribution for Image may be NSFW.
Clik here to view.
p
into the relative entropy Image may be NSFW.
Clik here to view.
H(p \to q)
.

We now want to ask a slightly different question than before. Before we were asking what our beliefs were about the underlying “true” probabilities generating the die rolls. Now we’ve already fixed those beliefs, and we’re instead going to ask what our beliefs are about the empirical distribution Image may be NSFW.
Clik here to view.
q
, which we now no longer know, conditioned on the fact that Image may be NSFW.
Clik here to view.
q \in E
. By Bayes’ theorem, this is

Image may be NSFW.
Clik here to view.
\displaystyle \mathbb{P}(q | E) \propto \mathbb{P}(q) \mathbb{P}(E | q)

where Image may be NSFW.
Clik here to view.
\mathbb{P}(E | q)
is either Image may be NSFW.
Clik here to view.
1
if Image may be NSFW.
Clik here to view.
q \in E
 or Image may be NSFW.
Clik here to view.
0
if Image may be NSFW.
Clik here to view.
q \not \in E
, and Image may be NSFW.
Clik here to view.
\mathbb{P}(q) \propto \exp N H(q)
 as before. Overall, we conclude the following.

Theorem: Starting from the indifference prior, as Image may be NSFW.
Clik here to view.
N \to \infty
Bayesian updates converge towards believing that the empirical distribution Image may be NSFW.
Clik here to view.
q
is the maximum entropy distribution in Image may be NSFW.
Clik here to view.
E
.

In some sense this is not at all a deep statement: it’s just the observation that entropy describes the asymptotics of the multinomial distribution, together with conditioning on Image may be NSFW.
Clik here to view.
q \in E
. Although it is somewhat interesting that conditioning on Image may be NSFW.
Clik here to view.
q \in E
, in this setup, is done by seeing that Image may be NSFW.
Clik here to view.
q
appears to asymptotically lie in Image may be NSFW.
Clik here to view.
E
as Image may be NSFW.
Clik here to view.
N \to \infty
.

Edit: This is essentially the Wallis derivation, but with a much larger emphasis placed on the choice of prior.

Example. Suppose Image may be NSFW.
Clik here to view.
E
consists of all probability distributions such that the expected value

Image may be NSFW.
Clik here to view.
\langle X \rangle = \sum X_k q_k

of some random variable Image may be NSFW.
Clik here to view.
X : \{ 1, 2, \dots n \} \to V
(possibly vector-valued) is fixed; call this fixed value Image may be NSFW.
Clik here to view.
X_0
. Then we want to maximize Image may be NSFW.
Clik here to view.
H(q) = - \sum q_k \log q_k
subject to this constraint and the constraint Image may be NSFW.
Clik here to view.
\sum q_k = 1
. This is again a Lagrange multiplier problem. We’ll introduce a vector-valued Lagrange multipler Image may be NSFW.
Clik here to view.
\lambda \in V^{\ast}
, as well as a scalar Lagrange multiplier Image may be NSFW.
Clik here to view.
r - 1
for the constraint Image may be NSFW.
Clik here to view.
\sum q_k = 1
that will later disappear from the calculation. (The Image may be NSFW.
Clik here to view.
-1
will slightly simplify the calculation.)

 Then the method of Lagrange multipliers says that any maximum must be a critical point of the function

Image may be NSFW.
Clik here to view.
\displaystyle L(q, \lambda) = \lambda \langle X \rangle + (r - 1) \left( \sum q_k \right) - H(q)

for some value of Image may be NSFW.
Clik here to view.
\lambda
 and Image may be NSFW.
Clik here to view.
r
. (Here we are hiding the dependence of Image may be NSFW.
Clik here to view.
\langle X \rangle
on Image may be NSFW.
Clik here to view.
q
.) Using the fact that

Image may be NSFW.
Clik here to view.
\displaystyle \frac{d}{dx} \left( -x \log x \right) = - \log x - 1

we compute that

Image may be NSFW.
Clik here to view.
\displaystyle \frac{\partial L}{\partial q} = \lambda (X_k) + \log q_k + r

and setting these partial derivatives equal to Image may be NSFW.
Clik here to view.
0
gives

Image may be NSFW.
Clik here to view.
\displaystyle q_k = e^{- \lambda (X_k) - r} = \frac{1}{Z} e^{-\lambda (X_k)}

where Image may be NSFW.
Clik here to view.
Z = e^r
. But since the Image may be NSFW.
Clik here to view.
q_k
must sum to Image may be NSFW.
Clik here to view.
1
, Image may be NSFW.
Clik here to view.
Z
is a normalization constant determined by this condition, and in fact must be the partition function

Image may be NSFW.
Clik here to view.
\displaystyle Z = \sum_k e^{-\lambda (X_k)}
.

From here we can compute that the expected value of Image may be NSFW.
Clik here to view.
X
is

Image may be NSFW.
Clik here to view.
\displaystyle \langle X \rangle = \frac{1}{Z} \sum_k X_k e^{-\lambda (X_k)} = - \frac{d \log Z}{d \lambda}

and the entropy is

Image may be NSFW.
Clik here to view.
\displaystyle H = \langle -\log q \rangle = \log Z + \frac{1}{Z} \sum_k \lambda (X_k) e^{-\lambda (X_k)} = \log Z + \lambda \langle X \rangle
.

In statistical mechanics, a “die” is a statistical-mechanical system, and Image may be NSFW.
Clik here to view.
X
is a vector of variables such as energy and particle number describing that system. Image may be NSFW.
Clik here to view.
\lambda
, the Lagrange multiplier, is a vector of conjugate variables such as (inverse) temperature and chemical potential. The probability distribution Image may be NSFW.
Clik here to view.
q
we’ve just described is the canonical ensemble if Image may be NSFW.
Clik here to view.
X
consists only of energy and the grand canonical ensemble if Image may be NSFW.
Clik here to view.
X
consists of energy and particle numbers.

The uniform prior we assumed using the principle of indifference, possibly after conditioning on a fixed value of the energy (rather than a fixed expected value), is the microcanonical ensemble. The assumption that this is a reasonable prior is called the fundamental postulate of statistical mechanics. As Terence Tao explains here, in a suitable finite toy model involving Markov chains at equilibrium it can be proven rigorously, but in more complicated settings the fundamental postulate is harder to justify, and of course in some settings it will just be wrong. In these settings we can instead use the principle of maximum relative entropy.


Viewing latest article 10
Browse Latest Browse All 10

Trending Articles