-
Notifications
You must be signed in to change notification settings - Fork 17
/
prml_errata.tex
6152 lines (5727 loc) · 263 KB
/
prml_errata.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt,a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern,exscale}
%\usepackage{newtxtext}
\usepackage[scaled=.95]{newpxtext}
\usepackage{textcomp}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{bm}
\usepackage[all]{xy}
\usepackage{microtype}
\usepackage[svgnames]{xcolor}
\usepackage[margin=2.5cm]{geometry}
\usepackage{sectsty}
\allsectionsfont{\sffamily\color{Red}}
\paragraphfont{\sffamily}
\subparagraphfont{\sffamily}
\usepackage[font={sf,small},labelfont={bf,color=Red},labelsep=quad]{caption}
\usepackage{graphicx}
\graphicspath{{./fig/}}
\usepackage{booktabs}
\usepackage{enumitem}
\usepackage{natbib}
\usepackage{hyperxmp}
\usepackage[
bookmarks=true,bookmarksopen=true,
backref=page,colorlinks,allcolors=MidnightBlue,
]{hyperref}
\linespread{1.05}
%\renewcommand{\familydefault}{\sfdefault}
\newcommand{\email}[1]{\href{mailto:#1}{\texttt{#1}}}
\newcommand{\twitter}[1]{\href{https://twitter.com/#1}{\texttt{@#1}}}
\newcommand{\github}[1]{\href{https://github.com/#1}{\texttt{@#1}}}
\newcommand{\www}{\colorbox{MediumBlue}{\textcolor{White}{\textbf{\textsf{www}}}}}
\newcommand{\Section}[1]{%
\section*{#1}
\addcontentsline{toc}{section}{#1}}
\newcommand{\Subsection}[1]{%
\subsection*{#1}
\addcontentsline{toc}{subsection}{#1}}
\newcommand{\erratum}[1]{%
\subsubsection*{#1}
\addcontentsline{toc}{subsection}{#1}}
% TODO:
% Split lengthy paragraphs into shorter ones and use paragraph headers where appropriate to help
% the reader better understand the organization of the text.
\newcommand{\parhead}[1]{%
\paragraph{#1}
\addcontentsline{toc}{subsubsection}{#1}}
\title{More PRML Errata}
\author{Yousuke Takada (\github{yousuketakada}) \\ \email{yousuketakada@gmail.com}}
\date{\today}
\hypersetup{
pdftitle = {More PRML Errata},
pdfauthor = {Yousuke Takada},
pdfsubject = {Some more errata and additional comments for PRML},
pdfkeywords = {PRML, errata, Bishop, machine learning},
}
\begin{document}
\maketitle
\Section{Preface}
This report is an unofficial list of errata for \emph{Pattern Recognition and Machine Learning} or
PRML by \citet*{Bishop:PRML}.
In this report, I have compiled only errata
that are not (yet) listed in the official errata document~\citep{Svensen:PRML_errata}
at the time of this writing.
Currently, there are three versions of the official errata document,
corresponding to the first (2006), second (2007), and third (2009) printings of PRML;
consult the
\href{https://www.microsoft.com/en-us/research/people/cmbishop/\#!prml-book}{support page}
for how to identify which printing your copy of PRML is from.\footnote{%
The last line but one of the bibliographic information page
(the page immediately preceding the dedication page) of my copy of PRML
reads ``9 8 7 (corrected at 6th printing 2007).''
Note that, although it says it is from the ``6th printing,''
it is actually from the \emph{second} printing according to
the official errata document~\citep{Svensen:PRML_errata}
so that I refer to Version~2 of the official errata.}
I have tried to follow the terminology and the notation used in PRML and the official errata
as closely as possible.
In particular, when specifying the location of an error,
I follow the notational conventions (such as ``Paragraph~2, Line~\textminus1'')
adopted by \citet{Svensen:PRML_errata}.
As the official errata document ``is intended to be complete,''
this report also tries to correct even trivial typographical errors as well.
PRML is arguably such a great textbook in the field of machine learning
that it is extremely helpful and easier to understand than any other similar account.
That said, there are a few subtleties
that some readers might have hard time to appreciate.
In hopes to help such readers get out of struggle or
obtain a better grasp on some important concepts,
I have also included in this report some comments and suggestions for improving the readability
to which I would have liked to refer when I first read PRML.
It should be noted that the readers of the Japanese edition of PRML will find its
\href{http://ibisforest.org/index.php?PRML}{support page} (in Japanese) useful.
Along with other information such as the contents,
it lists errata specific to the Japanese edition as well as
some additional errata for the English edition,
which have also been included in this report for the reader's convenience.
I welcome all comments and suggestions regarding this report;
please send me any such feedback via email or, preferably,
by creating an ``issue'' or a ``pull request'' at the following GitHub repository
\begin{center}
\url{https://github.com/yousuketakada/prml\_errata}
\end{center}
where you can find the source code of this report
as well as other supporting material.
\Subsection{Acknowledgements}
I would like to thank those who have helped improve this report
by informing me of yet more errata and clarifications
(for PRML and also for this report itself).
In particular, I am grateful to
Christopher Sahnwaldt (\github{jcsahnwaldt}),
Mark-Jan Nederhof,
David S.\ Rosenberg (\github{davidrosenberg}), and
Tommy (\github{tommyod})
for their invaluable comments and discussions.
\Section{Corrections and Comments}
\erratum{Page~xi}
% See https://github.com/yousuketakada/prml_errata/pull/1
Paragraph~\textminus2, Line~1:
$\left|f(x)/g(x)\right|$ should read $\left|g(x)/f(x)\right|$ (with the functions swapped).
Moreover, the limit we take is \emph{not} necessarily the one specified in the text,
i.e., $x \to \infty$,
but is often implied by the context (see below).
\parhead{Big O notation}
The big O notation~$g(x) = O\left(f(x)\right)$ generally denotes that
$\left|g(x)/f(x)\right|$ is bounded as $x \to c$
where, if $c$ is not given explicitly,
$c = 0$ for a Taylor series such as (2.299) or (D.1); or
$c = \infty$ for an asymptotic series such as (10.241) or
for computational complexity (see, e.g., Section~5.2.3),
for example.
See \citet{NIST:DLMF} for other asymptotic and order notations.
\erratum{Page~5}
Equation~(1.1):
The lower ellipsis ($\ldots$) should be centered ($\cdots$).\footnote{%
The \LaTeX{} command~\texttt{\char`\\cdots} or,
with the \texttt{amsmath} or \texttt{mathtools} package,
\texttt{\char`\\dots} (in most cases) will do.}
Specifically, (1.1) should read
\begin{equation}
y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^{M} w_j x^j .
\end{equation}
\erratum{Page~10}
The text after (1.4):
The lower ellipsis ($\ldots$) should be centered ($\cdots$).
\erratum{Page~14}
Equation~(1.8):
Note that the \emph{conditional probability}~$p\left( Y = y_j \middle| X = x_i \right) =
n_{ij}/c_i$ is well-defined only when $p(X = x_i) = c_i/N > 0$.
\erratum{Page~18}
Equation~(1.27):
It should be noted that
the transformation~$g : \mathcal{Y} \to \mathcal{X}$ must be \emph{bijective} or, equivalently,
\emph{invertible} in general in order for the change of variables~(1.27) to be meaningful
where $\mathcal{X}$ and $\mathcal{Y}$ are the domains of
the distributions~$p_x(\cdot)$ and $p_y(\cdot)$, respectively.\footnote{%
A function~$f : \mathcal{X} \to \mathcal{Y}$ is said to be \emph{bijective}
(or \emph{one-to-one correspondence})
if, for any $y \in \mathcal{Y}$, there exists a unique $x \in \mathcal{X}$ such that $y = f(x)$.
Note also that bijectivity is equivalent to invertibility:
A function~$f : \mathcal{X} \to \mathcal{Y}$ is bijective if and only if $f$ is \emph{invertible},
i.e., there exists an \emph{inverse function}~$f^{-1} : \mathcal{Y} \to \mathcal{X}$
(which is, of course, also bijective)
such that $f^{-1} \circ f$ is the identity function on $\mathcal{X}$
and $f \circ f^{-1}$ is the identity function on $\mathcal{Y}$
(an identity function is a function that maps every input to itself).}
This can be easily understood by noting that,
if, for any \emph{measurable}\footnote{%
A \emph{measurable} set is such that
we can consider its ``size'' (or \emph{measure}) in some sense
so that the integration over it is meaningful;
this is a concept formally defined in a branch of mathematics called \emph{measure theory}
(see, e.g., \citet{Feller:ProbabilityTheory2} in the context of probability theory
and \citet{Tao:MeasureTheory} for an introduction to \emph{Lebesgue integration}),
which however ``lies outside the scope of [PRML]'' (Page~19, Paragraph~3, Line~\textminus5).
The reason why we restrict ourselves to measurable subsets here is, of course, that
we indeed have ``pathological'' ones that are not measurable.
However, since it is safe to say that all the sets we meet in practice are measurable
(for example, measurable subsets of $\mathbb{R}$ include
all the open sets~$(a_1, b_1), (a_2, b_2), \dots$ and
their countable unions~$(a_1, b_1) \cup (a_2, b_2) \cup \dots$),
we omit the ``measurable'' qualifier for brevity in the rest of this report.}
subset~$\mathcal{X}_0 \subset \mathcal{X}$ of $\mathcal{X}$,
the \emph{preimage}\footnote{%
Let $f : \mathcal{X} \to \mathcal{Y}$ be some function
from a set~$\mathcal{X}$ (the \emph{domain})
to another set~$\mathcal{Y}$ (the \emph{codomain}).
The \emph{preimage} (or \emph{inverse image}) of
a subset~$\mathcal{Y}_0 \subset \mathcal{Y}$ of the codomain~$\mathcal{Y}$
under $f$ is defined by
\begin{equation}
f^{-1}(\mathcal{Y}_0) \equiv \{ x \in \mathcal{X} \mid f(x) \in \mathcal{Y}_0 \}
\end{equation}
so that $f^{-1}(\mathcal{Y}_0) \subset \mathcal{X}$ is a subset of the domain~$\mathcal{X}$.}~%
$\mathcal{Y}_0 = g^{-1}(\mathcal{X}_0) \subset \mathcal{Y}$
of $\mathcal{X}_0$ under the transformation~$g$ is again a measurable subset of $\mathcal{Y}$
(such a function~$g$ is said to be a \emph{measurable function}; and
every continuous function is, in fact, measurable) and
we can make the change of variables~$x = g(y)$ as
\begin{align}
\int_{\mathcal{X}_0} p_x(x) \, \mathrm{d}x &=
\int_{\mathcal{Y}_0} p_x(x)
\left| \frac{\mathrm{d}x}{\mathrm{d}y} \right| \mathrm{d}y
\\
&=
\int_{\mathcal{Y}_0} p_x(g(y))
\left| g'(y) \right| \mathrm{d}y
\end{align}
then we can identify the integrand of the right hand side as $p_y(y)$.
It is interesting to note here that we can represent the change of variables
in terms of expectation~\citep{Watanabe:BayesStatistics} so that
\begin{equation}
p_y(y) = \mathbb{E}_{x}\left[ \delta\left( y - g^{-1}(x) \right) \right]
\end{equation}
where $\delta(\cdot)$ is the \emph{Dirac delta function}~\citep{NIST:DLMF} and
$y = g^{-1}(x)$ is the inverse function of $x = g(y)$.\footnote{%
In fact, we have
\begin{align}
\mathbb{E}_{x}\left[ \delta\left( y_0 - g^{-1}(x) \right) \right]
&= \int \delta\left( y_0 - g^{-1}(x) \right) p_x(x) \, \mathrm{d}x \\
&= \int \delta\left( y_0 - y \right) p_x(g(y)) \left| g'(y) \right| \mathrm{d}y \\
&= p_x(g(y_0)) \left| g'(y_0) \right|
\end{align}
where we have made the change of variables~$x = g(y)$.}
\parhead{Multivariate change of variables}
Similarly to the univariate case,
the multivariate version of the change of variables formula is given by
\begin{equation}
p(\mathbf{y}) =
p(\mathbf{x})
\left|
\operatorname{det}\left(\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\right)
\right|
\label{eq:multivariate_change_of_variables}
\end{equation}
where we have again assumed that
the transformation between $\mathbf{x}$ and $\mathbf{y}$ is bijective as well as differentiable;
% See https://github.com/yousuketakada/prml_errata/pull/13
and $\partial\mathbf{x}/\partial\mathbf{y} = (\partial x_i / \partial y_j )$ is
the \emph{Jacobian matrix}~(C.18).
\erratum{Page~19}
Equation~(1.32):
Similarly to the discrete case~(1.8), the conditional probability density
\begin{equation}
p(y|x) = \frac{p(x, y)}{p(x)}
\label{eq:conditional_density}
\end{equation}
is well-defined only for $x$ such that $p(x) > 0$.
Note however that we can assume that the \emph{product rule}~(1.32) itself holds in general
(see below).
\parhead{Conditional as a random variable}
The conditional probability~$p(y|x)$ (or, more generally,
the conditional expectation~$\mathbb{E}_y\left[\cdot\middle|x\right]$\footnote{%
The conditional probability density~$p_{\left.y\middle|x\right.}(y|x)$ can be written as
a conditional expectation so that
\begin{equation}
p_{\left.y\middle|x\right.}(y_0|x) = \mathbb{E}_y\left[\delta(y_0 - y)\middle|x\right]
\end{equation}
where $\delta(\cdot)$ is the \emph{Dirac delta function}~\citep{NIST:DLMF}.})
can be regarded as
a random variable dependent on the random variable~$x$ on which conditioned.
As such, we see that
there is no particular problem if $p(y|x)$ is undefined for $x$ such that $p(x) = 0$;
for such $x$, we can define $p(y|x)$ arbitrarily as long as we have $p(y|x)p(x) = 0$,
say, $p(y|x) \equiv 0$
although this is clearly not a valid probability because $\int p(y|x) \, \mathrm{d}y = 0 \neq 1$.
Note also that, if $y$ is independent of $x$,
then the conditional~$p(y|x)$ is well-defined regardless of $x$ so that $p(y|x) = p(y)$ and,
again, the product rule~(1.32) holds, which, in this case, reduces to $p(x, y) = p(x) p(y)$.
\erratum{Page~33}
The line after (1.73):
The best-fit \emph{log} likelihood~$p\left(\mathcal{D}\middle|\mathbf{w}_{\text{ML}}\right)$
should read $\ln p\left(\mathcal{D}\middle|\mathbf{w}_{\text{ML}}\right)$.
\erratum{Page~42}
% Due to Mark-Jan Nederhof
Paragraph~1, Line~1:
``the class~$j$'' should read ``class~$\mathcal{C}_j$.''
\erratum{Page~44}
% Due to Mark-Jan Nederhof
Paragraph~2, Line~\textminus3:
Insert a space before the sentence starting ``There has been\dots''
(\emph{the third printing only}).\footnote{%
Something strange must have happened in the third (2009) printing,
leading to some spacing issues where a sentence ends with a reference number
such as ``Figure~1.27.''
We can also find other ``regression'' errors in the third printing.
Such errors are marked ``\emph{the third printing only}'' in this report.}
\erratum{Page~46}
Equation~(1.85):
A period (.) should terminate (1.85).
\erratum{Page~47}
Paragraph~1, Line~\textminus1:
$\mathbb{E}_{t}\left[\mathbf{t}\middle|\mathbf{x}\right]$
should read
$\mathbb{E}_{\mathbf{t}}\left[\mathbf{t}\middle|\mathbf{x}\right]$
(the subscript should be the vector~$\mathbf{t}$).
\erratum{Page~47}
Equation~(1.90):
The quantity~$\operatorname{var}\left[ t \middle| \mathbf{x} \right]$ is
the \emph{conditional variance},
which is defined similarly to the \emph{conditional expectation}~(1.37) so that
\begin{equation}
\operatorname{var}\left[ t \middle| \mathbf{x} \right] =
\mathbb{E}\left[
\left( t - \mathbb{E}\left[t\middle|\mathbf{x}\right] \right)^2 \middle| \mathbf{x}
\right]
\end{equation}
where we have omitted the subscript~$t$ in what should be
$\mathbb{E}_{t}\left[\cdot\middle|\mathbf{x}\right]$
\erratum{Page~51}
Equation~(1.98):
Following the notation~(1.93) for the entropy,
we should write the left hand side of (1.98) as
$\operatorname{H}[X]$ instead of $\operatorname{H}[p]$ so that
\begin{equation}
\operatorname{H}[X] = - \sum_{i} p(x_{i}) \ln p(x_{i}) .
\label{eq:entropy_in_nats}
\end{equation}
As suggested in Appendix~D,
if we regard the (differential) entropy~$\operatorname{H}[\cdot]$ as a \emph{functional},
then we see that
``the entropy could equally well have been written as $\operatorname{H}[p]$''
(Page~703, Paragraph~1, Lines~\textminus2 and \textminus1).
However, it is probably better to maintain the notational consistency here.
\erratum{Pages~55 and 56}
The text around (1.114):
There are some inaccuracies in the definitions and the properties of
\emph{convex} and \emph{strictly convex} functions.
First, a convex function is not necessarily differentiable
(consider, e.g., the absolute value function~$f(x) = |x|$, which is convex
but not differentiable at $x = 0$).
Second, even for twice differentiable functions,
strict positivity of the second derivative is not necessary
for convexity nor for strict convexity.
Third, the condition for strict convexity that
``the equality [in (1.114)] is satisfied only for $\lambda = 0$ and $\lambda = 1$''
(Page~56, Paragraph~1, Line~\textminus4)
is meaningless because the equality holds for any $\lambda$ when $a = b$.
In the following,
instead of correcting these errors one by one,
I would like to present slightly more general definitions for
convex and strictly convex functions
where we let the parameter~$\lambda$ vary only on the open set~$(0, 1)$,
rather than on the closed set~$[0, 1]$ as in PRML,
in order to avoid edge cases.
I also give some well-known properties regarding
convex and strictly convex functions that are twice differentiable
(which I think are intended to be addressed in PRML).
\parhead{Convexity and strict convexity}
Let $f : \mathcal{X} \to \mathbb{R}$ be a real-valued, continuous\footnote{%
Strictly speaking, we do not need to assume continuity here
because convexity implies continuity.
More specifically, a convex function~$f : \mathcal{X} \to \mathbb{R}$ is continuous
on the entire domain~$\mathcal{X}$
(except the boundary~$\partial\mathcal{X}$).
To see this, consider three points~$A, B, C$ on the convex curve~$y = f(x)$
whose $x$ values are $x_0, x_1, x_2 \in \mathcal{X}$, respectively,
where we assume that $x_0 < x_1 < x_2$.
First, we see from convexity that $B$ lies below the line~$AC$.
Again from convexity, we have
(i)~that, on the interval~$[x_0, x_1]$, the curve must be below $AB$ and above $BC$; and
(ii)~that, on the interval~$[x_1, x_2]$, the curve must be below $BC$ and above $AB$.
These requirements~(i) and (ii) imply continuity of $f(x)$ at $x = x_1$.
Since $x_1$ can be any element in $\mathcal{X}$ (except $\partial\mathcal{X}$),
we see that $f(x)$ is continuous (on $\mathcal{X} \setminus \partial\mathcal{X}$).}
function defined on some \emph{convex set}~$\mathcal{X}$ such that,
if $x_0, x_1 \in \mathcal{X}$,
then $(1 - \lambda) x_0 + \lambda x_1 \in \mathcal{X}$
for any $\lambda \in (0, 1)$.
The function~$f(x)$ is said to be \emph{convex} if
\begin{equation}
f((1 - \lambda) x_0 + \lambda x_1)
\leqslant
(1 - \lambda) f(x_0) + \lambda f(x_1)
\label{eq:inequality_for_convexity}
\end{equation}
for any $x_0, x_1 \in \mathcal{X}$ and $\lambda \in (0, 1)$.
If the equality in \eqref{eq:inequality_for_convexity} holds only when $x_0 = x_1$,
then $f(x)$ is said to be \emph{strictly convex}.
If $f(x)$ is twice differentiable, the following properties hold:
\begin{itemize}
\item[(i)]
The function~$f(x)$ is convex if and only if the second derivative~$f''(x)$ is nonnegative
for all $x \in \mathcal{X}$.
\item[(ii)]
If $f''(x)$ is strictly positive for all $x \in \mathcal{X}$, then $f(x)$ is strictly convex.
Note however that the converse of this does not hold (consider, e.g., $f(x) = x^4$,
which is strictly convex but $f''(0) = 0$).
\end{itemize}
It is easy to see that convexity implies $f''(x) \geqslant 0$.
In fact, from Taylor's theorem,\footnote{%
\emph{Taylor's theorem}~\citep{AS:Handbook} states that
an $n$ times differentiable function~$f(x)$
can be expanded around a given point~$x_0$ in the form
\begin{equation}
f(x) = f(x_0) + (x - x_0) f'(x_0) +\frac{(x - x_0)^2}{2} f''(x_0) + \dots +
\frac{(x - x_0)^{n - 1}}{(n - 1)!} f^{(n - 1)}(x_0) + R_n(x)
\end{equation}
where the remainder term~$R_n(x)$ (known as the \emph{Lagrange remainder}) satisfies
\begin{equation}
R_n(x) = \frac{(x - x_0)^{n}}{n!} f^{(n)}(\xi)
\end{equation}
for some $\xi$ between $x_0$ and $x$.
If we expand $f(x)$ around $x$ with displacement~$\pm h$,
then the Taylor series expansion (up to the third order term) is given by
\begin{equation}
f(x \pm h) =
f(x) \pm h f'(x) + \frac{h^2}{2} f''(x) \pm \frac{h^3}{3!} f^{(3)}(x) + O\left(h^4\right) .
\end{equation}
}
we can write $f''(x)$ in the form
\begin{equation}
f''(x) = \lim_{h \to 0} \frac{f(x - h) - 2 f(x) + f(x + h)}{h^2}
\end{equation}
where we see that the right hand side is nonnegative
from the inequality condition~\eqref{eq:inequality_for_convexity}
in which we let $\lambda = 1/2$ and $x_i = x + (2i - 1) h$ where $i = 0, 1$.
To show the converse,
we again make use of Taylor's theorem and
expand $f(x)$ around $x_{\lambda} = (1 - \lambda) x_0 + \lambda x_1$
so that we have
\begin{equation}
f(x_i) = f(x_{\lambda}) + (x_i - x_{\lambda}) f' (x_{\lambda})
+ \frac{(x_i - x_{\lambda})^2}{2} f''(\xi_i)
\end{equation}
for some $\xi_i$ between $x_{\lambda}$ and $x_i$.
With this expansion,
we can write the right hand side of \eqref{eq:inequality_for_convexity} in the form
\begin{equation}
(1 - \lambda) f(x_0) + \lambda f(x_1)
= f(x_{\lambda})
+ \lambda (1 - \lambda) \frac{(x_1 - x_0)^2}{2}
\left\{
\lambda f''(\xi_0) + (1 - \lambda) f''(\xi_1)
\right\}
\end{equation}
from which we see that $f''(x) \geqslant 0$ implies convexity
and also that $f''(x) > 0$ implies strict convexity.
\erratum{Page~56}
Equation~(1.116):
In general, we cannot interpret $\lambda_i$ in \emph{Jensen's inequality}~(1.115) as
the probability distribution over a discrete random variable~$x$
such that $\lambda_i \equiv p(x = x_i)$
because, since (1.115) holds for any $\left\{x_i\right\}$,
we can take, say, $x_i = x_j$ and $\lambda_i \neq \lambda_j$ where $i \neq j$,
assigning different probabilities to the same value of $x$.
Actually, (1.116) is a special case of (1.115).
An equivalent of (1.115) in terms of random variables can be derived as follows.
\parhead{Jensen's inequality in terms of random variables}
In order to interpret (1.115) probabilistically,
we instead introduce another set of underlying random variables~$\mathbf{z}$
such that $\lambda_i \equiv p(\mathbf{z} = \mathbf{z}_i)$ and
a function~$\xi(\cdot)$ such that $x_i \equiv \xi(\mathbf{z}_i)$,
giving a result slightly more general than (1.116)
\begin{equation}
f\left(\mathbb{E}_{\mathbf{z}}\left[\xi(\mathbf{z})\right]\right)
\leqslant \mathbb{E}_{\mathbf{z}}\left[f\left(\xi(\mathbf{z})\right)\right]
\label{eq:Jensen_inequality_expectation_with_z}
\end{equation}
where $f(\cdot)$ is a convex function but $\xi(\cdot)$ can be any.
Moreover, if $f(\cdot)$ is strictly convex,
the equality in \eqref{eq:Jensen_inequality_expectation_with_z} holds if and only if
$\xi(\mathbf{z})$ is constant with probability one or \emph{almost surely},\footnote{%
Here, the proviso~\emph{almost surely} (often abbreviated as a.s.) or
\emph{almost everywhere} (a.e.) means that
there may be some exceptions but they can occur only with probability zero (or \emph{measure} zero)
so that we can safely ignore them;
this is a concept formally defined in a branch of mathematics called
\emph{measure theory}~\citep{Feller:ProbabilityTheory2,Tao:MeasureTheory}.
As in PRML, we omit such ``almost'' provisos for brevity in the rest of this report.}
meaning that there exists some constant~$\xi_0$ such that
$\xi(\mathbf{z}) = \xi_0$ on the range of $\mathbf{z}$ \emph{almost everywhere},
in which case we have $\mathbb{E}_{\mathbf{z}}\left[\xi(\mathbf{z})\right] = \xi_0$
and the both sides of \eqref{eq:Jensen_inequality_expectation_with_z} equal $f(\xi_0)$.
See Figure~\ref{fig:Jensen_inequality} for an intuitive, physical ``proof'' of
the inequality~\eqref{eq:Jensen_inequality_expectation_with_z}.
\begin{figure}
\centering
\includegraphics{jensen_inequality.eps}
\caption{A physical ``proof'' of
Jensen's inequality~\citep{MacKay:Information}.
Let us suppose that we have a set of point masses~$m_i = p(\mathbf{z}=\mathbf{z}_i)$,
denoted by filled blue circles ({\large\textcolor{blue}{$\bullet$}})
with areas proportional to $m_i$,
and place them at the corresponding
locations~$(x, y) = \left(\xi(\mathbf{z}_i), f(\xi(\mathbf{z}_i))\right)$
on a convex curve~$y = f(x)$.
The center of gravity of those masses, which is $\left(
\mathbb{E}_{\mathbf{z}}\left[\xi(\mathbf{z})\right],
\mathbb{E}_{\mathbf{z}}\left[f\left(\xi(\mathbf{z})\right)\right]
\right)$, denoted by a cross sign ($\boldsymbol{\times}$),
must lie above the convex curve and thus right above the point~$\left(
\mathbb{E}_{\mathbf{z}}\left[\xi(\mathbf{z})\right],
f\left(\mathbb{E}_{\mathbf{z}}\left[\xi(\mathbf{z})\right]\right)
\right)$ on the curve, denoted by a filled square ($\blacksquare$),
showing Jensen's inequality~\eqref{eq:Jensen_inequality_expectation_with_z}.
One can also see that, if $f(\cdot)$ is strictly convex,
the equality in \eqref{eq:Jensen_inequality_expectation_with_z}
implies that $\xi(\mathbf{z})$ is almost surely constant
(it is trivial to show that the converse is true).}
\label{fig:Jensen_inequality}
\end{figure}
Since the random variables~$\mathbf{z}$ as well as their probability~$p(\mathbf{z})$ can be chosen
arbitrarily,
it makes sense to write $\mathbf{z}$ implicit in \eqref{eq:Jensen_inequality_expectation_with_z},
giving a simpler form of Jensen's inequality
\begin{equation}
f\left(\mathbb{E}\left[\xi\right]\right)
\leqslant \mathbb{E}\left[f\left(\xi\right)\right] .
\end{equation}
For continuous random variables, we have
\begin{equation}
f\left(\int\xi(\mathbf{x}) p(\mathbf{x}) \,\mathrm{d}\mathbf{x}\right) \leqslant
\int f\left(\xi(\mathbf{x})\right) p(\mathbf{x}) \,\mathrm{d}\mathbf{x}
\label{eq:Jensen_inequality_continuous}
\end{equation}
where we have used $\mathbf{x}$ to denote the underlying random variables
for which we take the expectations.
By making use of \eqref{eq:Jensen_inequality_continuous},
one can show that
the \emph{Kullback-Leibler divergence}~$\operatorname{KL}\left( p \middle\| q \right)$
given by (1.113) satisfies \emph{Gibbs's inequality}
\begin{equation}
\operatorname{KL}\left( p \middle\| q \right) \geqslant 0
\label{eq:Gibbs_inequality}
\end{equation}
with equality if and only if $p(\mathbf{x}) = q(\mathbf{x})$ almost everywhere.
See the following erratum for more details.
\erratum{Page~56}
Equation~(1.118):
There are some difficulties in the derivation~(1.118) of
Gibbs's inequality~\eqref{eq:Gibbs_inequality}.
First, the quantity~$\xi(\mathbf{x}) = q(\mathbf{x})/p(\mathbf{x})$ is undefined for $\mathbf{x}$
such that $p(\mathbf{x}) = 0$.
Second, the convex function~$f(\xi) = -\ln\xi$ is undefined for $\xi = 0$,
which occurs where $q(\mathbf{x}) = 0$ and $p(\mathbf{x}) > 0$.
In order to avoid these (potential) difficulties,
we shall take a different approach~\citep{MacKay:Information,KullbackLeibler:Information}
in which we make use of Jensen's inequality~\eqref{eq:Jensen_inequality_continuous}
with respect to $q(\mathbf{x})$ where
we identify $f(\xi) = \xi\ln\xi$ and
$\xi(\mathbf{x}) = p(\mathbf{x})/q(\mathbf{x})$.
Note that we can safely proceed with this approach
because we can assume $q(\mathbf{x}) > 0$ without loss of generality (see below).
In the following, we first show Gibbs's inequality~\eqref{eq:Gibbs_inequality} along this line,
after which we also see that
the Kullback-Leibler divergence~$\operatorname{KL}\left(p\middle\|q\right)$ is convex
in the sense that it satisfies the inequality~\eqref{eq:inequality_for_convexity}
with respect to the pair of the distributions~$(p, q)$.
Finally, we give an alternative proof of Gibbs's inequality~\eqref{eq:Gibbs_inequality}
in terms of a generalized version of the Kullback-Leibler divergence such that
it is extended for unnormalized distributions.
\parhead{A proof of Gibbs's inequality in terms of Jensen's inequality}
Let us first examine the behavior of the integrand of
the Kullback-Leibler divergence~$\operatorname{KL}\left( p \middle\| q \right)$
\begin{equation}
p(\mathbf{x}) \ln p(\mathbf{x}) - p(\mathbf{x}) \ln q(\mathbf{x})
\label{eq:integrand_of_KL}
\end{equation}
where $q(\mathbf{x})$ or $p(\mathbf{x})$ vanishes.
We notice that, if $q(\mathbf{x}) \to 0$ for $\mathbf{x}$ such that $p(\mathbf{x}) > 0$,
the integrand~\eqref{eq:integrand_of_KL} diverges so that
$\operatorname{KL}\left( p \middle\| q \right) \to \infty$.
On the other hand, the integrand~\eqref{eq:integrand_of_KL} always vanishes for $\mathbf{x}$
such that $p(\mathbf{x}) = 0$ regardless of the values of $q(\mathbf{x})$.\footnote{%
Recall that we have defined $0 \log_2 0 \equiv 0$ or, equivalently, $0 \ln 0 \equiv 0$
(Page~49, Paragraph~2, Line~\textminus2) so that
the entropy in ``bits''~(1.93) or ``nats''~\eqref{eq:entropy_in_nats} is well-defined.}
Therefore, in order for $\operatorname{KL}\left( p \middle\| q \right)$ to be well-defined,
we must have $p(\mathbf{x}) = 0$ for all $\mathbf{x}$ such that $q(\mathbf{x}) = 0$ or,
stated differently, the \emph{support} of $p(\mathbf{x})$ must be contained in
that of $q(\mathbf{x})$, i.e.,\footnote{%
One can understand \eqref{eq:supp_p_contained_in_supp_q} intuitively
from the perspective of information theory as follows.
As we have seen in Section~1.6.1,
the Kullback-Leibler divergence~$\operatorname{KL}\left( p \middle\| q \right)$
can be interpreted as
the average amount of information (in nats)
wasted to encode samples generated from the source~$p$
with an encoder optimized for $q$.
In order for this relative entropy~$\operatorname{KL}\left( p \middle\| q \right)$
to be well-defined (i.e., in order that we can encode every sample),
the support of the source~$p$ must be contained in that of the encoder~$q$.
Note also that, in the context of variational inference
in which we minimize $\operatorname{KL}\left( p \middle\| q \right)$
by optimizing $p$ given $q$ (\emph{variational Bayes}) or
$q$ given $p$ (\emph{expectation propagation}),
the property~\eqref{eq:supp_p_contained_in_supp_q} is referred to as
\emph{zero-forcing} or \emph{zero-avoiding}
because $q = 0$ implies $p = 0$ or, equivalently, $p > 0$ implies $q > 0$
(see Section~10.1.2).}
\begin{equation}
\operatorname{supp}(p) \subset \operatorname{supp}(q)
\label{eq:supp_p_contained_in_supp_q}
\end{equation}
where $\operatorname{supp}(p) = \{ \mathbf{x} \mid p(\mathbf{x}) > 0 \}$ and so on.\footnote{%
Here, we define the \emph{support}~$\operatorname{supp}(f)$ of
a real-valued function~$f : \mathcal{X} \to \mathbb{R}$ by
\begin{equation}
\operatorname{supp}(f) \equiv \{ \mathbf{x} \in \mathcal{X} \mid f(\mathbf{x}) \neq 0 \}
\end{equation}
so that, for a probability density function~$p : \mathcal{X} \to [0, \infty)$, we have
$\operatorname{supp}(p) = \{ \mathbf{x} \in \mathcal{X} \mid p(\mathbf{x}) > 0 \}$.}
Note that, for two sets~$A$ and $B$,
we write $A \subset B$ or $B \supset A$ if $a \in B$ for all $a \in A$
so that $\operatorname{supp}(p)$ may equal $\operatorname{supp}(q)$ in
\eqref{eq:supp_p_contained_in_supp_q}.\footnote{%
In fact, a set~$A$ equals another set~$B$ or $A = B$
if and only if $A \subset B$ and $A \supset B$.}
Assuming the condition~\eqref{eq:supp_p_contained_in_supp_q} under which
the Kullback-Leibler divergence~$\operatorname{KL}\left( p \middle\| q \right)$
is well-defined,
we can restrict the integration in $\operatorname{KL}\left( p \middle\| q \right)$
only over the support~$\Omega \equiv \operatorname{supp}(q)$ of $q(\mathbf{x})$.
Identifying $f(\xi) = \xi\ln\xi$ (see Figure~\ref{fig:x_ln_x}) and
$\xi(\mathbf{x}) = p(\mathbf{x})/q(\mathbf{x})$,
we have
\begin{align}
\operatorname{KL}\left( p \middle\| q \right)
&= \int_{\Omega} q(\mathbf{x})
\frac{p(\mathbf{x})}{q(\mathbf{x})}
\ln \left\{\frac{p(\mathbf{x})}{q(\mathbf{x})}\right\}
\mathrm{d}\mathbf{x} \\
&= \int_{\Omega} q(\mathbf{x})
f\left(\xi(\mathbf{x})\right)
\mathrm{d}\mathbf{x} \\
&\geqslant f\left(
\int_{\Omega} q(\mathbf{x}) \xi(\mathbf{x}) \, \mathrm{d}\mathbf{x}
\right) \\
&= f\left(\int_{\Omega} p(\mathbf{x}) \, \mathrm{d}\mathbf{x} \right) \\
&= f(1) = 0
\end{align}
where we have used Jensen's inequality~\eqref{eq:Jensen_inequality_continuous}
with respect to $q(\mathbf{x})$ (instead of $p(\mathbf{x})$).
Note that, since $q(\mathbf{x}) > 0$ for all $\mathbf{x} \in \Omega$, we see that
$\xi(\mathbf{x}) = p(\mathbf{x})/q(\mathbf{x}) \geqslant 0$ is well-defined
for all $\mathbf{x} \in \Omega$ and
so is $f\left(\xi(\mathbf{x})\right) = \xi(\mathbf{x})\ln\xi(\mathbf{x})$.
Since $f(\xi)$ is strictly convex,
the equality~$\operatorname{KL}\left( p \middle\| q \right) = 0$ holds if and only if
$\xi(\mathbf{x}) = p(\mathbf{x})/q(\mathbf{x})$ is constant for all $\mathbf{x} \in \Omega$,
which, together with \eqref{eq:supp_p_contained_in_supp_q}, yields the equality condition that
$p(\mathbf{x}) = q(\mathbf{x})$ for all $\mathbf{x}$.
\begin{figure}
\centering
\input{fig/x_ln_x_plot.tex}
\caption{Plot of $f(x) = x \ln x$.
The function~$f(x)$ is a strictly convex function defined over $\left[0, \infty\right)$
where we have defined $f(0) = 0 \ln 0 \equiv 0$.
The curve~$y = f(x)$ takes the minimum at $(x, y) = \left(e^{-1}, -e^{-1}\right)$.
The roots (the values of $x$ such that $f(x) = 0$) are $x = 0$ and $x = 1$.}
\label{fig:x_ln_x}
\end{figure}
\parhead{Convexity of Kullback-Leibler divergence}
Let us next consider the weighted sum of two well-defined Kullback-Leibler divergences~%
$\operatorname{KL}\left(p_1\middle\|q_1\right), \operatorname{KL}\left(p_2\middle\|q_2\right)
< \infty$
\begin{align}
S &\equiv
\lambda \operatorname{KL}\left(p_1\middle\|q_1\right)
+ (1 - \lambda) \operatorname{KL}\left(p_2\middle\|q_2\right) \\
&=
\lambda
\int_{\Omega_1} p_1(\mathbf{x}) \ln\left\{\frac{p_1(\mathbf{x})}{q_1(\mathbf{x})}\right\}
\mathrm{d}\mathbf{x}
+ (1 - \lambda)
\int_{\Omega_2} p_2(\mathbf{x}) \ln\left\{\frac{p_2(\mathbf{x})}{q_2(\mathbf{x})}\right\}
\mathrm{d}\mathbf{x}
\end{align}
where $\lambda \in (0, 1)$.
Here, we again assume \eqref{eq:supp_p_contained_in_supp_q} for each pair of
the distributions~$(p_i, q_i)$, i.e.,
$p_i(\mathbf{x}) = 0$ for all $\mathbf{x} \notin \Omega_i \equiv \operatorname{supp}(q_i)$
where $i = 1, 2$.
Writing
\begin{align}
\begin{aligned}
a_1 &\equiv \lambda p_1(\mathbf{x}), &
a_2 &\equiv (1 - \lambda) p_2(\mathbf{x}) \\
b_1 &\equiv \lambda q_1(\mathbf{x}), &
b_2 &\equiv (1 - \lambda) q_2(\mathbf{x})
\end{aligned}
\end{align}
and noting that $a_i = b_i = 0$ outside $\Omega_i$ and $b_i > 0$ otherwise, we have
\begin{align}
S &=
\int_{\Omega_1} a_1 \ln \frac{a_1}{b_1} \,\mathrm{d}\mathbf{x} +
\int_{\Omega_2} a_2 \ln \frac{a_2}{b_2} \,\mathrm{d}\mathbf{x} \\
&=
\int_{\Omega_1\setminus\Omega_2} a_1 \ln\frac{a_1}{b_1} \,\mathrm{d}\mathbf{x} +
\int_{\Omega_2\setminus\Omega_1} a_2 \ln\frac{a_2}{b_2} \,\mathrm{d}\mathbf{x} +
\int_{\Omega_1 \cap \Omega_2}
\left[
a_1 \ln\frac{a_1}{b_1} +
a_2 \ln\frac{a_2}{b_2}
\right]
\mathrm{d}\mathbf{x}
\\
&\geqslant
\int_{\Omega_1 \cup \Omega_2}
\left(a_1 + a_2\right) \ln\frac{a_1 + a_2}{b_1 + b_2}
\,\mathrm{d}\mathbf{x}
\\
&=
\operatorname{KL}\left(
\lambda p_1 + (1 - \lambda) p_2
\middle\|
\lambda q_1 + (1 - \lambda) q_2
\right)
\end{align}
from which we see that
$\operatorname{KL}\left(p\middle\|q\right)$ is convex with respect to $(p, q)$.
Here, we have used the inequality\footnote{%
The inequality used here is a special case of the \emph{log sum inequality},
which states that, for nonnegative $a_i, b_i$,
\begin{equation}
\sum_i a_i \ln \frac{a_i}{b_i} \geqslant a \ln \frac{a}{b}
\end{equation}
with equality if and only if there exists some constant~$c$ such that $a_i = c b_i$ for all $i$
where $a = \sum_i a_i$ and $b = \sum_i b_i$.}
\begin{align}
a_1 \ln\frac{a_1}{b_1} +
a_2 \ln\frac{a_2}{b_2}
&=
\left( b_1 + b_2 \right)
\left[
\frac{b_1}{ b_1 + b_2 } f\left(\frac{a_1}{b_1}\right) +
\frac{b_2}{ b_1 + b_2 } f\left(\frac{a_2}{b_2}\right)
\right]
\\
&\geqslant
\left( b_1 + b_2 \right)
f\left(\frac{ a_1 + a_2 }{ b_1 + b_2 }\right)
\\
&=
\left( a_1 + a_2 \right)
\ln \frac{ a_1 + a_2 }{ b_1 + b_2 }
\end{align}
where we have again used Jensen's inequality~\eqref{eq:Jensen_inequality_expectation_with_z}
with $f(\xi) = \xi \ln \xi$.
We also see that
(i)~convexity of $\operatorname{KL}\left(p\middle\|q\right)$ with respect to $(p, q)$ implies
(ii)~convexity of $\operatorname{KL}\left(p\middle\|q\right)$ with respect to $p$.
Although the former convexity~(i) is not strict in general
(consider, e.g., the case where $\Omega_1 \cap \Omega_2 = \emptyset$),
the latter convexity~(ii) can be shown to be strict with a similar discussion as above
(where we let $q_1 = q_2$ and thus $\Omega_1 = \Omega_2$), i.e.,
$\operatorname{KL}\left(p\middle\|q\right)$ is strictly convex with respect to $p$
so that
\begin{equation}
\lambda \operatorname{KL}\left( p_1 \middle\| q \right) +
(1 - \lambda) \operatorname{KL}\left( p_2 \middle\| q \right)
\geqslant \operatorname{KL}\left( \lambda p_1 + (1 - \lambda) p_2 \middle\|q\right) ,
\quad \lambda \in (0, 1)
\label{eq:strict_convexity_of_KL_wrt_p}
\end{equation}
with equality if and only if $p_1 = p_2$.
\parhead{Extended Kullback-Leibler divergence}
Let us now define what we call the \emph{extended Kullback-Leibler divergence}~%
\citep{Minka:DivergenceMeasures,Zhu:InformationGeometric} as
\begin{equation}
\operatorname{EKL}\left(\widetilde{p}^{\,}\middle\|\widetilde{q}^{\,}\right)
\equiv
\int
\widetilde{p}(\mathbf{x})
\ln\left\{ \frac{\widetilde{p}(\mathbf{x})}{\widetilde{q}(\mathbf{x})} \right\}
\mathrm{d}\mathbf{x}
+
\int
\left\{
\widetilde{q}(\mathbf{x}) -
\widetilde{p}(\mathbf{x})
\right\}
\mathrm{d}\mathbf{x} .
\label{eq:extended_Kullback_Leibler_divergence}
\end{equation}
Note that
the definition~\eqref{eq:extended_Kullback_Leibler_divergence} of
the extended Kullback-Leibler divergence
includes a correction term of the form~$
\int
\left\{
\widetilde{q}(\mathbf{x}) -
\widetilde{p}(\mathbf{x})
\right\}
\mathrm{d}\mathbf{x}
$
so that it applies to
unnormalized distributions~$\widetilde{p}(\mathbf{x})$ and $\widetilde{q}(\mathbf{x})$.
One can easily see that,
for correctly normalized distributions~$p(\mathbf{x})$ and $q(\mathbf{x})$,
the correction term vanishes so that
\begin{equation}
\operatorname{KL} \left(p\middle\|q\right) =
\operatorname{EKL}\left(p\middle\|q\right) .
\label{eq:KL_as_EKL}
\end{equation}
The extended Kullback-Leibler divergence~\eqref{eq:extended_Kullback_Leibler_divergence}
can also be written in the form
\begin{equation}
\operatorname{EKL}\left(\widetilde{p}^{\,}\middle\|\widetilde{q}^{\,}\right)
=
\int \widetilde{p}(\mathbf{x}) \,
F\left(
\ln\left\{ \frac{\widetilde{p}(\mathbf{x})}{\widetilde{q}(\mathbf{x})} \right\}
\right)
\mathrm{d}\mathbf{x}
\end{equation}
where
\begin{equation}
F(t) = t + e^{-t} - 1, \quad t \in (-\infty, \infty) .
\end{equation}
Since $F(t) \geqslant 0$ with equality if and only if $t = 0$,\footnote{%
Note that $F(t)$ is a strictly convex function (because $F''(t) > 0$) and
takes the minimum~$F(t) = 0$ at $t = 0$.}
we see that an analogue of Gibbs's inequality~\eqref{eq:Gibbs_inequality} holds also for
the extended Kullback-Leibler divergence~\eqref{eq:extended_Kullback_Leibler_divergence},
i.e., we have
\begin{equation}
\operatorname{EKL}\left(\widetilde{p}^{\,}\middle\|\widetilde{q}^{\,}\right) \geqslant 0
\label{eq:Gibbs_inequality_for_EKL}
\end{equation}
with equality if and only if $\widetilde{p}(\mathbf{x}) = \widetilde{q}(\mathbf{x})$
for all $\mathbf{x}$.
Thus, we see that Gibbs's inequality~\eqref{eq:Gibbs_inequality} for
the ordinary Kullback-Leibler divergence~(1.113) can also be shown
from the Gibbs's inequality~\eqref{eq:Gibbs_inequality_for_EKL} for
the extended Kullback-Leibler divergence~\eqref{eq:extended_Kullback_Leibler_divergence}
via the relationship~\eqref{eq:KL_as_EKL}.
It is interesting to note here that,
if two (normalized) distributions~$p(\mathbf{x})$ and $q(\mathbf{x})$ are close
so that $p(\mathbf{x}) \approx q(\mathbf{x})$,
then the Kullback-Leibler divergence between $p(\mathbf{x})$ and $q(\mathbf{x})$
can be approximated~\citep{Watanabe:BayesStatistics} as
\begin{equation}
\operatorname{KL}\left(p\middle\|q\right) \approx
\frac{1}{2} \int
p(\mathbf{x})
\left\{
\ln p(\mathbf{x}) -
\ln q(\mathbf{x})
\right\}^2
\mathrm{d}\mathbf{x}
\end{equation}
where we have again used \eqref{eq:KL_as_EKL} and
the Taylor expansion~$F(t) \simeq t^2/2$ of $F(t)$ around $t = 0$.\footnote{%
Although the sign~$\simeq$ is used for any type of approximate equality in PRML,
I make some (soft) distinction between $\simeq$ and $\approx$ in this report:
I use $\simeq$ for series expansions where approximation can be exact at some point; and
$\approx$ for general (e.g., numerical) approximation including the Laplace approximation
(see Section~4.4).}
Moreover, one can easily see from the linearity of the correction term that
the extended Kullback-Leibler divergence~\eqref{eq:extended_Kullback_Leibler_divergence}
enjoys convexity similarly to the ordinary Kullback-Leibler divergence~(1.113).
Specifically, $\operatorname{EKL}\left(\widetilde{p}^{\,}\middle\|\widetilde{q}^{\,}\right)$ is
(i)~convex with respect to $(\widetilde{p}, \widetilde{q}^{\,})$ and
(ii)~strictly convex with respect to $\widetilde{p}$.
\erratum{Page~59}
% Due to Mark-Jan Nederhof
Exercise~1.7, Line~2:
``To do this consider, the\dots'' should be
``To do this, consider the\dots''
\erratum{Page~61}
% Due to Mark-Jan Nederhof
Exercise~1.15, Line~\textminus1:
Add a period (.) at the end of the last sentence.
\erratum{Page~62}
Exercise~1.18, the text after (1.142):
``Gamma'' should read ``gamma'' (without capitalization).
\erratum{Page~64}
% Due to Mark-Jan Nederhof
Exercise~1.28, Line~1:
In Section~1.6,
the quantity~$h(x)$ is introduced as a measure of the information gained on observing
the random variable~$x$,
whereas the \emph{entropy} is the average of $h(x)$ over $x$.
The first sentence should thus read, e.g.,
``In Section~1.6, we introduced the idea of entropy as
the average of the information~$h(x)$ gained\dots''
\erratum{Page~64}
% Due to Mark-Jan Nederhof
Exercise~1.28, Lines~3 and 4:
``the entropy functions are additive, so that\dots'' should read, e.g.,
``the information~$h(\cdot)$ is additive so that\dots''
(see also the previous erratum).
\erratum{Page~69}
Equation~(2.2):
Although the \emph{Bernoulli} distribution~$\operatorname{Bern}\left( x \middle| \mu \right)$
is a valid, correctly normalized probability distribution for any value of
the parameter~$0 \leqslant \mu \leqslant 1$ (Page~69, Paragraph~1, Line~1),\footnote{%
Similarly to $0 \ln 0 \equiv 0$, we have defined $0^0 \equiv 1$ here.}
it becomes \emph{degenerate}, i.e., $x$ is fixed to a single value so that $x = 0$ or $x = 1$
if $\mu = 0$ or $\mu = 1$, respectively.\footnote{%
Note however that, in the context of Bayesian inference
in which we regard the parameter~$\mu$ as a random variable,
the Bernoulli distribution~$\operatorname{Bern}\left( x \middle| \mu \right)$ cannot be degenerate
because $\mu \in (0, 1)$ almost surely
so that the edge cases, i.e., $\mu = 0$ and $\mu = 1$, do not matter after all.}
Such degenerate distributions are often difficult to treat with some generality so that
they actually seem to have been excluded (implicitly or explicitly)
in most of the discussions in PRML.
For example, we should be unable to take the logarithm of
the Bernoulli likelihood~(2.5) to give (2.6)
if the distribution can be degenerate
(because the logarithm diverges if $\mu = 0$ or $\mu = 1$).
More generally, one cannot identify a degenerate distribution with
any member of the \emph{exponential family} (see Section~2.4).
For instance, the degenerate Bernoulli cannot be expressed
as the exponential of the logarithm as in (2.197)
because its \emph{natural parameter}~(2.198) is again not well-defined for $\mu = 0$ or $\mu = 1$.
\parhead{Restriction on probability of success}
In order for the Bernoulli distribution~$\operatorname{Bern}\left( x \middle| \mu \right)$
\emph{not} to be degenerate,
we assume in this report that the parameter~$\mu$ (called the \emph{probability of success})
is restricted on the open set~$(0, 1)$, rather than on the closed set~$[0, 1]$ as in PRML,
so that we can write
\begin{equation}
\operatorname{Bern}\left( x \middle| \mu \right) = \mu^{x} (1 - \mu)^{1 - x}
= \exp\left\{ x \ln \mu + (1 - x) \ln (1 - \mu) \right\},
\quad \mu \in (0, 1)
\label{eq:Bernoulli_in_exponential_of_logarithm}
\end{equation}
while we can still consider the degenerate case as the limit~$\mu \to 0$ or $\mu \to 1$.
Similar discussions also apply to other discrete distributions, i.e.,
the \emph{binomial} distribution~(2.9) and the \emph{multinomial} distribution~(2.34),
for which we shall, therefore, assume
the restrictions~$\mu \in (0, 1)$ and $\mu_k \in (0, 1)$ for all $k$,
respectively.
See also our definition~\eqref{eq:multinoulli_distribution} of the \emph{multinoulli} distribution.
Accordingly, the same restrictions~$\mu \in (0, 1)$ and $\mu_k \in (0, 1)$ for all $k$ are assumed
also on the domains of the (conjugate) prior distributions, i.e.,
the \emph{beta} distribution~(2.13) and the \emph{Dirichlet} distribution~(2.38).
Note that these restrictions do not affect correct normalization of the distributions;