-
Notifications
You must be signed in to change notification settings - Fork 149
/
lecture_note.tex
6634 lines (5857 loc) · 312 KB
/
lecture_note.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{report}
\usepackage{geometry}
\geometry{margin=1in}
\usepackage{authblk}
\usepackage{mathptmx}
\usepackage{url,latexsym,amsmath,amsthm,xspace,rotating,multirow,multicol,xspace,amssymb,paralist}
\usepackage{euscript}
\usepackage{fancybox,xcolor}
\usepackage{longtable}
\usepackage{paralist}
\usepackage[normalem]{ulem}
\usepackage[pdftex]{hyperref}
\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{algorithm}
\usepackage{cancel}
\usepackage{url}
\usepackage{latexsym}
\usepackage{times}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{xspace}
\usepackage{tabularx}
\usepackage{multicol}
\usepackage{multirow}
%\usepackage{hyperref}
\usepackage{url}
%\usepackage{natbib}
\usepackage{wrapfig}
\usepackage{comment}
\usepackage{listings}
\usepackage{color}
\usepackage[utf8]{inputenc}
\usepackage{fancyvrb}
\usepackage{booktabs}
\usepackage{color}
\usepackage[normalem]{ulem}
\newcommand{\obs}{\text{obs}}
\newcommand{\mis}{\text{mis}}
\newcommand{\qt}[1]{\left<#1\right>}
\newcommand{\ql}[1]{\left[#1\right]}
\newcommand{\hess}{\mathbf{H}}
\newcommand{\jacob}{\mathbf{J}}
\newcommand{\hl}{HL}
\newcommand{\cost}{\mathcal{L}}
\newcommand{\lout}{\mathbf{r}}
\newcommand{\louti}{r}
\newcommand{\outi}{y}
\newcommand{\out}{\mathbf{y}}
\newcommand{\gauss}{\mathbf{G_N}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\softmax}{\text{softmax}}
\newcommand{\targ}{\mathbf{t}}
\newcommand{\metric}{\mathbf{G}}
\newcommand{\sample}{\mathbf{z}}
\newcommand{\f}{\text{f}}
%\newcommand{\log}{\text{log}}
\newcommand{\bmx}[0]{\begin{bmatrix}}
\newcommand{\emx}[0]{\end{bmatrix}}
\newcommand{\qexp}[1]{\left<#1\right>}
\newcommand{\vect}[1]{\mathbf{#1}}
\newcommand{\vects}[1]{\boldsymbol{#1}}
\newcommand{\matr}[1]{\mathbf{#1}}
\newcommand{\var}[0]{\operatorname{Var}}
\newcommand{\std}[0]{\operatorname{std}}
\newcommand{\cov}[0]{\operatorname{Cov}}
\newcommand{\diag}[0]{\operatorname{diag}}
\newcommand{\matrs}[1]{\boldsymbol{#1}}
\newcommand{\va}[0]{\vect{a}}
\newcommand{\vb}[0]{\vect{b}}
\newcommand{\vc}[0]{\vect{c}}
\newcommand{\ve}[0]{\vect{e}}
\newcommand{\vh}[0]{\vect{h}}
\newcommand{\vv}[0]{\vect{v}}
\newcommand{\vx}[0]{\vect{x}}
\newcommand{\vp}[0]{\vect{p}}
\newcommand{\vz}[0]{\vect{z}}
\newcommand{\vw}[0]{\vect{w}}
\newcommand{\vs}[0]{\vect{s}}
\newcommand{\vf}[0]{\vect{f}}
\newcommand{\vi}[0]{\vect{i}}
\newcommand{\vo}[0]{\vect{o}}
\newcommand{\vd}[0]{\vect{d}}
\newcommand{\vy}[0]{\vect{y}}
\newcommand{\vg}[0]{\vect{g}}
\newcommand{\vm}[0]{\vect{m}}
\newcommand{\vu}[0]{\vect{u}}
\newcommand{\vL}[0]{\vect{L}}
\newcommand{\vr}[0]{\vect{r}}
\newcommand{\vone}[0]{\vect{1}}
\newcommand{\mW}[0]{\matr{W}}
\newcommand{\mE}[0]{\matr{E}}
\newcommand{\mG}[0]{\matr{G}}
\newcommand{\mX}[0]{\matr{X}}
\newcommand{\mY}[0]{\matr{Y}}
\newcommand{\mQ}[0]{\matr{Q}}
\newcommand{\mU}[0]{\matr{U}}
\newcommand{\mF}[0]{\matr{F}}
\newcommand{\mV}[0]{\matr{V}}
\newcommand{\mA}{\matr{A}}
\newcommand{\mC}{\matr{C}}
\newcommand{\mD}{\matr{D}}
\newcommand{\mL}[0]{\matr{L}}
\newcommand{\mR}[0]{\matr{R}}
\newcommand{\mS}{\matr{S}}
\newcommand{\mI}{\matr{I}}
\newcommand{\td}[0]{\text{d}}
\newcommand{\TT}[0]{\vects{\theta}}
\newcommand{\vsig}[0]{\vects{\sigma}}
\newcommand{\valpha}[0]{\vects{\alpha}}
\newcommand{\vmu}[0]{\vects{\mu}}
\newcommand{\vzero}[0]{\vect{0}}
\newcommand{\tf}[0]{\text{m}}
\newcommand{\tdf}[0]{\text{dm}}
\newcommand{\grad}[0]{\nabla}
\newcommand{\alert}[1]{\textcolor{red}{#1}}
\newcommand{\N}[0]{\mathcal{N}}
\newcommand{\YY}[0]{\mathcal{Y}}
\newcommand{\BB}[0]{\mathcal{B}}
\newcommand{\LL}[0]{\mathcal{L}}
\newcommand{\HH}[0]{\mathcal{H}}
\newcommand{\RR}[0]{\mathbb{R}}
\newcommand{\MM}[0]{\mathcal{M}}
\newcommand{\OO}[0]{\mathbb{O}}
\newcommand{\II}[0]{\mathbb{I}}
\newcommand{\Scal}[0]{\mathcal{S}}
\newcommand{\sigmoid}{\sigma}
\newcommand{\E}[0]{\mathbb{E}}
\newcommand{\enabla}[0]{\ensuremath{%
\overset{\raisebox{-0.3ex}[0.5ex][0ex]{%
\ensuremath{\scriptscriptstyle e}}}{\nabla}}}
\newcommand{\enhnabla}[0]{\nabla_{\hspace{-0.5mm}e}\,}
\newcommand{\eos}[0]{\ensuremath{\left< \text{eos}\right>}}
\newcommand{\todo}[1]{{\Large\textcolor{red}{#1}}}
\newcommand{\done}[1]{{\Large\textcolor{green}{#1}}}
\newcommand{\dd}[1]{\ensuremath{\mbox{d}#1}}
\DeclareMathOperator*{\argmax}{\arg \max}
\DeclareMathOperator*{\argmin}{\arg \min}
\newcommand{\newln}{\\&\quad\quad{}}
\newcommand{\BP}{\text{BP}}
\newcommand{\PPL}{\text{PPL}}
\newcommand{\PL}{\text{PL}}
\newcommand{\MatSum}{\text{MatSum}}
\newcommand{\MatMul}{\text{MatMul}}
\newcommand{\KL}{\text{KL}}
\newcommand{\data}{\text{data}}
\newcommand{\rect}{\text{rect}}
\newcommand{\maxout}{\text{maxout}}
\newcommand{\train}{\text{train}}
\newcommand{\val}{\text{val}}
\newcommand{\init}{\text{init}}
\newcommand{\fenc}{\text{fenc}}
\newcommand{\renc}{\text{renc}}
\newcommand{\enc}{\text{enc}}
\newcommand{\dec}{\text{dec}}
\newcommand{\test}{\text{test}}
\newcommand{\Ax}{\mathcal{A}_x}
\newcommand{\Ay}{\mathcal{A}_y}
\newcommand{\ola}{\overleftarrow}
\newcommand{\ora}{\overrightarrow}
\newcommand{\ov}{\overline}
\newcommand{\ts}{\rule{0pt}{2.6ex}} % Top strut
\newcommand{\ms}{\rule{0pt}{0ex}} % Middle strut
\newcommand{\bs}{\rule[-1.2ex]{0pt}{0pt}} % Bottom strut
\newcommand{\specialcell}[2][c]{%
\begin{tabular}[#1]{@{}c@{}}#2\end{tabular}}
%\usepackage{bibentry}
%\nobibliography*
\begin{document}
\title{Natural Language Understanding with Distributed Representation}
\author{Kyunghyun Cho}
\affil{
Courant Institute of Mathematical Sciences and \\
Center for Data Science,\\
New York University
}
\maketitle
\pagenumbering{arabic}
\abstract{
This is a lecture note for the course DS-GA 3001 $\left<\right.$Natural
Language Understanding with Distributed Representation$\left.\right>$ at the
Center for Data Science\footnote{
\url{http://cds.nyu.edu/}
}, New York University in Fall, 2015. As the name of the course suggests,
this lecture note introduces readers to a neural network based approach to
natural language understanding/processing. In order to make it as
self-contained as possible, I spend much time on describing basics of
machine learning and neural networks, only after which how they are used for
natural languages is introduced. On the language front, I almost solely
focus on language modelling and machine translation, two of which I
personally find most fascinating and most fundamental to natural language
understanding.
After about a month of lectures and about 40 pages of writing this lecture
note, I found this fascinating note \cite{goldberg2015primer} by Yoav
Goldberg on neural network models for natural language processing. This note
deals with wider topics on natural language processing with
distributed representations in more details, and I highly recommend you to read it
(hopefully along with this lecture note.) I seriously wish Yoav had written
it earlier so that I could've simply used his excellent note for my course.
This lecture note had been written quite hastily as the course progressed,
meaning that I could spare only about 100 hours in total for this note.
This is my lame excuse for likely many mistakes in this lecture note, and I
kindly ask for your understanding in advance. Again, how grateful I
would've been had I found Yoav's note earlier.
I am planning to update this lecture note gradually over time, hoping that I
will be able to convince the Center for Data Science to let me teach the
same course next year. The latest version will always be available both in
pdf and in latex source code from
\url{https://github.com/nyu-dl/NLP_DL_Lecture_Note}. The arXiv version will
be updated whenever a major revision is made.
I thank all the students and non-students who took\footnote{
In fact, they are still taking the course as of 24 Nov 2015. They have
two guest lectures and a final exam left until the end of the course.
}
this course and David Rosenberg for feedback.
}
\tableofcontents
\chapter{Introduction}
\label{chap:intro}
This lecture is going to be the only one where I discuss some philosophical,
meaning nonpractical, arguments, because according to Chris Manning and Hinrich
Schuetze, ``{\it even practically-minded people have to confront the issue of
what prior knowledge to try to build into their model}''
\cite{manning1999foundations}.
\section{Route we will {\it not} take}
\label{sec:wrong_route}
\subsection{What is Language?}
The very first question we must ask ourselves before starting this course is the
question of what natural language is. Of course, the rest of this course does
not in any way require us to know what natural language is, but it is a
philosophical question I recommend everyone, including myself, to ponder upon
once a while.
When I start talking about languages with anyone, there is a single person who
never misses to be mentioned, that is Noam Chomsky. His view has greatly
influenced the modern linguistics, and although many linguists I have talked to
claim that their work and field have long moved on from Chomsky's, I can feel
his shadow all over them.
My first encounter with Chomsky was at the classroom of $<$Automata$>$ from my
early undergrad years. I was not the most attentive student back then, and all I
can remember is Chomsky's hierarchy and how it has shaped our view on languages,
in this context, programming/computer languages. A large part of the course was
dedicated to explaining which class of languages emerges given a set of
constraints on a set of {\it generating rules}, or production rules.
For instance, if we are given a set of generating rules that do not depend on
the context/meaning of non-terminal symbols (context-free grammar, CFG), we get
a context-free language. If we put a bit of constraints to CFG that each
generating rule is such that a non-terminal symbol is replaced by either a
terminal symbol, a terminal symbol by a non-terminal symbol or an empty symbol,
then we get a regular grammar. Similarly to CFG, we get a regular language from
the regular grammar, and the regular language is a subset of the context-free
language.
What Chomsky believes is that this kind of approach applies also to human
languages, or natural languages. There exists a set of generating rules that
{\it generates} a natural language. But, then, the obvious question to follow is
where those generating rules are. Where are they stored? How are they stored? Do
we have separate generating rules for different languages?
\subsection{Language Understanding}
\label{sec:language_understanding_wrong}
\paragraph{Understanding Human Language}
Those questions are interesting, but out of scope for this course. Those
questions are the ones linguists try to answer. Generative linguistics aims at
figuring out what those rules are, how they are combined to form a valid
sentence, how they are adapted to different languages and so on. We will leave
these to linguists and continue on to our journey of {\it building a machine
that understands human languages}.
\paragraph{Natural Language Understanding}
So, let's put these questions aside and trust Chomsky that we, humans, are
specially designed to store those generating rules somewhere in the brain
\cite{chomsky1959review,carnie2013syntax}. Or, better yet, let's trust Chomsky
that there's a universal grammar {\it built in} our brain. In other words, let's
say we were born with this set of generating rules for natural languages, and
while growing, we have adapted this universal grammar toward our native tongue
(language variation).
When we decide to speak of something (whatever that is and however implausible
that is), our brain quickly picks up a sequence of some of those generating
rules and starts generating a sentence accordingly. Of course, those rules do
not generate a sentence directly, but generates a sequence of control signals to
move our muscles to make sound. When heard by other people who understand your
language, the sound becomes a sentence.
In our case, we are more interested in a {\it machine} hearing that sound, or a
sentence from here on. When a machine heard this sentence, what would/should a
{\it language understanding machine} do to understand a language, or more simply
a sentence? Again, we are assuming that this sentence was generated from
applying a sequence of the existing generating rules.
Under our assumption, a natural first step that comes to my mind is to figure
out that sequence of the generating rules which led to the sentence. Once the
sequence is found, or in a fancier term, inferred, the next step will be to
figure out what kind of mental state of the speaker led to those generating
rules.
Let's take an example sentence ``{\it Our company is training workers}'' (from
Sec. 1.3 of \cite{manning1999foundations}), which is a horrible choice, because
this was used as an example of ambiguity in parsing. Regardless, a speaker
obviously has an awesome image of her company which trains its workers and wants
to tell a machine about this. This mental state is used to select the following
generating rules (assuming a phrase structure grammar)\footnote{
Stanford Parser: \url{http://nlp.stanford.edu:8080/parser}
}:
\begin{verbatim}
(ROOT
(S
(NP (PRP$ Our) (NN company))
(VP (VBZ is)
(VP (VBG training)
(NP (NNS workers))))))
\end{verbatim}
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\textwidth]{figures/manning_parse.pdf}
\caption{A parse of ``{\it Our company is training workers}''}
\label{fig:manning_parse}
\end{figure}
The machine hears the sentence ``{\it Our company is training workers}'' and
infers the parse in Fig.~\ref{fig:manning_parse}. Then, we can make a simple set
of rules (again!) to let the machine answer questions about this sentence, kinds
of questions that imply that the machine has understood the sentence (language).
For instance, given a question ``{\it Who is training workers?}'', the machine
can answer by noticing that the question is asking for the subject of the verb
phrase
``{\it is training}'' acted on the object ``{\it workers}'' and that the subject
is ``{\it Our company}''.
\paragraph{Side Note: Bayesian Language Understanding} This generative view of
languages fits quite well with Bayesian modelling (see, e.g.,
\cite{perfors2006poverty}.) There exists a hidden mechanism, or a set of
generating rules and a rule governing their composition, which can be modelled
as a latent variable $Z$. Given these rules, a language or a sentence $X$ is
generated according to the conditional distribution $P(X|Z)$. Then,
understanding language (by humans) is equivalent to computing the posterior
distribution over all possible sets of generating rules and their compositional
rules (i.e., $P(Z|X)$.) This answers the question of what is the most likely
mechanism underlying the observed language.
Furthermore, from the perspective of machines, Bayesian approach is attractive.
In this case, we assume to know {\it the} set of rules in advance and let the
latent variable $Z$ denote the specific configuration (use) of those rules.
Given this sequence of applying the rules, a sentence $X$ is generated via the
conditional distribution $P(X|Z)$. Machine understanding of language is
equivalent to inferring the posterior distribution over $Z$ given $X$.
For more details about Bayesian approaches (in the context of machine learning),
please, refer to \cite{bishop2006pattern} or take the course DS-GA 1005
Inference and Representation by Prof. David Sontag.
\paragraph{Understanding vs. Using}
What's clear from this example is that in this generative view of languages,
there is a clear separation between understanding and using. Inferring the
generating rules from a given sentence is {\it understanding}, and answering a
question based on this understanding, {\it using}, is a separate activity.
Understanding part is done when the underlying (true) structure has been
determined regardless of how this understanding be used.
To put it in a slightly different wording, language understanding does not
require its use, or downstream tasks. In this road that we will {\em not} take
in this course, understanding exists as it is, regardless of what the understood
insight/knowledge will be used for. And, this is the reason why we do not walk
down this road.
\section{Road we {\it will} take}
\label{sec:intro}
\subsection{Language as a Function}
In this course, we will view a natural/human language as ``{\it a system
intended to communicate ideas from a speaker to a hearer}''
\cite{winograd1972understanding}. What this means is that we do not view a
language as a separate entity that exists on its own. Rather, we view a whole
system or behaviour of {\it communication} as a language. Furthermore, this view
dictates that we must take into account the world surrounding a speaker and a
hearer in order to understand language.
Under this view of language, language or rather its usage become somewhat
similar to action or behaviour. Speaking of something is equivalent to acting on
a listener, as both of them influence the listener in one way or another. The
purpose of language is then to influence another by efficiently communicate
one's will or intention.\footnote{
Chomsky does not agree: ``{\it it is wrong to think of human use of language
as characteristically informative, in fact or in intention.}''
\cite{chomsky1968linguistic}.
} This hints at how language came to be (or may have come to be): (evolution)
language has evolved to facilitate the exchange of ideas among people (learning)
humans learn language by being either encouraged or punished for the use of
language. This latter view on how language came to be is similar in spirit to
the behaviourism of B. F. Skinner (``{\it necessary mediation of reinforcement by
another organism}'' \cite{skinner2014verbal}.)
This is a radical departure from the generative view of human language, where
language existed on its own and its understanding does not necessarily require
the existence of the outside world nor the existence of a listener. It is no
wonder why Chomsky was so harsh in criticizing Skinner's work in
\cite{chomsky1959review}.
This departure, as I see it, is the departure toward a functional view of
language. {\it Language is a function of communication}.
\subsection{Language Understanding as a Function Approximation}
Let's make a large jump here such that we consider this function as a
mathematical function. This function (called language) takes as input the state
of the surrounding world, the speaker's speech, either written, spoken or signed
and the listener's mental state\footnote{
We assume here that a such thing exists however it is represented in
our brain.
} Inside the function, the listener's mental state is updated to incorporate the
new idea from the speaker's speech. The function then returns a response by the
listener (which may include ``no response'' as well) and a set of non-verbal
action sequences (what would be the action sequence if the speaker insulted
the listener?).
In this case, language understanding, both from humans' and machines'
perspective, boils down to figuring out the internal working of this function.
In other words, we understand language by learning the internal mechanism of the
function. Furthermore, this view suggests that the underlying structures of
language are heavily dependent on the surrounding environment (context) as well
as on the target task. The former (context dependence) is quite clear, as the
function takes as input the context, but the latter may be confusing now.
Hopefully, this will become clearer later in the course.
How can we approximate this function? How can we figure out the internal working
mechanism of this function? What tools do we have?
\paragraph{Language Understanding by Machine Learning}
This functional view of languages suddenly makes machine learning a very
appealing tool for understanding human languages. After all, function
approximation is {\em the} core of machine learning. Classification is a
classical example of function approximation, clustering is a function
approximation where the target is not given, generative modeling learns a
function that returns a probability of an input, and so on.
When we approximate a function in machine learning, the prime ingredient is
data. We are given data which was either generated from this function
(unsupervised learning) or well fit this function (supervised learning), based
on which we adjust our approximation to the function, often iteratively, to best
fit the data. But, I must note here that it does not matter how well the approximated
function fits the data it was fitted to, but matters how well this approximation fits
{\em unseen} data.\footnote{
This is a matter of generalization, and we will talk
about this more throughout the course.
}
In language understanding, this means that we collect a large data set of input
and output pairs (or conversations together with the recording of the
surrounding environment) and fit some arbitrary function to well predict the
output given an input. We probably want to evaluate this approximation in a
novel conversation. If this function makes a conversation just like a person,
voil\`{a}, we made a machine that passed the Turing test. Simple, right?
\paragraph{Problem}
Unfortunately, as soon as we try to do this, we run into a big problem. This
problem is not from machine learning nor languages, but the definition of this
function of language.
Properly approximating this function requires us to either simulate or record
the whole world (in fact, the whole universe.) For, this function takes as input
and maintains as internal state the surrounding world (context) and the mental
state of the individual (speaker.) This is unavoidable, if we wanted to very
well approximate this function as a whole.
It is unclear, however, whether we want to approximate the full function. For a
human to survive, yes, it is likely that the full function is needed. But, if
our goal is restricted to a certain task (such as translation, language
modelling, and so on), we may not want to approximate this function fully. We
probably want to approximate only a subset of this whole function. For instance,
if our goal is to understand the process of translation from one language to
another, we can perhaps ignore all but the speech input to the function and all
but the speech output from the function, because often a (trained) person can
translate a sentence in one language to another without knowing the whole
context.
This latter approach to language understanding--approximating a partial function
of languages-- will be at the core of this course. We will talk about various
language tasks that are a part of this whole function of language. These tasks
will include, but are not limited to, language modelling, machine translation,
image/video description generation and question answering. For these tasks and
potentially more, we will study how to use machine learning, or more
specifically deep learning, to solve these tasks by approximating sub-functions
of language.
\chapter{Function Approximation as Supervised Learning}
\label{chap:function_approx}
Throughout this course, we will extensively use artificial neural
networks\footnote{
From here on, I will simply drop artificial and call them neural networks.
Whenever I say ``neural network'', it refers to artificial neural networks.
}
to approximate (a part of) the function of natural language. This makes it
necessary for us to study the basics of neural networks first, and this lecture
and a couple of subsequent ones are designed to serve this purpose.
\section{Function Approximation: Parametric Approach}
\subsection{Expected Cost Function}
Let us start by defining a data distribution $p_{\text{data}}$.
$p_{\text{data}}$ is defined over a pair of input and output vectors, $\vx \in
\II^d$ and $\vy \in \OO^k$, respectively. $\II$ and $\OO$ are respectively sets
of all possible input and output values, such as $\RR$, $\left\{ 0, 1\right\}$
and $\left\{0, 1, \ldots, L\right\}$. This data distribution is not known to us.
The goal is to find a relationship between $\vx$ and $\vy$. More specifically,
we are interested in finding a function $f:\RR^d \to \OO^k$ that generates the
output $\vy$ given its corresponding input $\vx$. The very first thing we should
do is to put some constraints on the function $f$ to make our search for the
correct $f$ a bit less impossible. In this lecture, and throughout the course, I
will consider only a parametric function $f$, in which case the function is
fully specified with a set of parameters $\TT$.
Next, we must define a way to measure how well the function $f$ approximates the
underlying mechanism of generation ($\vx \to \vy$). Let's denote by $\hat{\vy}$ the
output of the function with a particular set $\TT$ of parameters and a given input
$\vx$:
\begin{align*}
\hat{\vy} = f_{\TT}(\vx)
\end{align*}
How well $f$ approximates the true generating function is equivalent to how far
$\hat{\vy}$ is from the correct output $\vy$. Let's use $D(\hat{\vy}, \vy)$ for
now call this distance\footnote{
Note that we do not require this distance to satisfy the triangular
inequality, meaning that it does not have to be a distance. However, I will
just call it distance for now.
}
between $\hat{\vy}$ and $\vy$
It is clear that we want to find $\TT$ that minimizes $D(\hat{\vy}, \vy)$ for
every pair in the space ($\RR^d \times \OO^k$). But, wait, every pair equally
likely? Probably not, for we do not care how well $f_{\TT}$ approximates the
true function, when a pair of input $\vx$ and output $\vy$ is unlikely, meaning
we do not care how bad the approximation is, if $p_\text{data}(\vx, \vy)$ is
small. However, this is a bit difficult to take into account, as we must decided
on the threshold below which we consider any pair irrelevant.
Hence, we {\em weight} the distance between the approximated $\hat{\vy}$ and the
correct $\vy$ of each pair $(\vx, \vy)$ in the space by its probability $p(\vx,
\vy)$. Mathematically saying, we want to find
\begin{align*}
\arg\min_{\TT} \int_{\vx} \int_{\vy} p_{\text{data}}(\vx, \vy) D(\hat{\vy}, \vy) \dd{\vx}
\dd{\vy},
\end{align*}
where the integral $\int$ should be replaced with the summation $\sum$ if any of
$\vx$ and $\vy$ is discrete.
We call this quantity being minimized with respect to the parameters $\TT$ a
cost function $C(\TT)$. This is equivalent to computing the {\em expected}
distance between the predicted output $\hat{\vy}$ and the correct one $\vy$:
\begin{align}
\label{eq:expected_cost}
C(\TT) =& \int_{\vx} \int_{\vy} p_{\text{data}}(\vx, \vy) D(\hat{\vy}, \vy) \dd{\vx}
\dd{\vy}, \\
=& \E_{(\vx,\vy) \sim p_{\text{data}}}\left[ D(\hat{\vy}, \vy) \right]
\end{align}
This is often called an expected loss or risk, and minimizing this cost function
is referred to as {\em expected risk minimization}~\cite{Vapnik1995}.
Unfortunately $C(\TT)$ cannot be (exactly) computed for a number of reasons. The
most important reason among them is simply that we don't know what the data
distribution $p_{\text{data}}$ is. Even if we have access to $p_{\text{data}}$,
we can exactly compute $C(\TT)$ only with heavy assumptions on both the data
distribution and the distance function.\footnote{Why?}
\subsection{Empirical Cost Function}
This does not mean that we are doomed from the beginning. Instead of the
full-blown description of the data distribution $p_{\text{data}}$, we will
assume that someone miraculously gave us a finite set of pairs drawn from the
data distribution. We will call this a training set:
\begin{align*}
\left\{ (\vx^1, \vy^1), \ldots, (\vx^N, \vy^N) \right\}.
\end{align*}
As we have access to the samples from the data distribution, we can use Monte
Carlo method to approximate the expected cost function $C(\TT)$ such that
\begin{align}
\label{eq:empirical_cost}
C(\TT) \approx \tilde{C}(\TT) = \frac{1}{N} \sum_{n=1}^N D(\hat{\vy^{n}},
\vy^{n}).
\end{align}
We call this approximate $\tilde{C}(\TT)$ of the expected cost function, an
empirical cost function (or empirical risk or empirical loss.)
Because empirical cost function is readily computable, we will mainly work with
the empirical cost function not with the expected cost function. However, keep
in mind that at the end of the day, the goal is to find a set of parameters that
minimizes the {\em expected} cost.
\section{Learning as Optimization}
We often call this process of finding a good set of parameters that minimizes
the expected cost {\em learning}. This term is used from the perspective of a
machine which implements the function $f_{\TT}$, as it {\em learns} to
approximate the true generating function $f$ from training data.
From what I have described so far, it may have become clear even without me
mentioning that learning is {\em optimization}. We have a clearly defined
function (the empirical cost function $\tilde{C}$) which needs to be minimized
with respect to its input $\TT$.
\subsection{Gradient-based Local Iterative Optimization}
There are many optimization algorithms one can use to find a set of parameters
that minimizes $\tilde{C}$. Sometimes, you can even find the optimal set of
parameters in a closed form equation.\footnote{
One such example is a linear regression where
\begin{itemize}
\item $f_{\TT=\left\{ \mW\right\}}(\vx) = \mW \vx$
\item $D(\hat{\vy}, \vy) = \frac{1}{2} \| \hat{\vy} - \vy \|^2$
\end{itemize}
In this case, the optimal $\mW$ is
\begin{align}
\label{eq:opt_lin}
\mW = \mY \mX^\top (\mX \mX^\top)^{-1},
\end{align}
where
\begin{align*}
\mX = \left[ \vx^1; \ldots ;\vx^N\right], \mY = \left[ \vy^1; \ldots ;\vy^N\right].
\end{align*}
Try it yourself!
}
In most cases, because there is no known closed-form solution, it is typical to
use an iterative optimization algorithm (see \cite{Fletcher1987} for in-depth
discussion on optimization.)
By an {\em iterative} optimization, I mean an algorithm which refines its
estimate of the optimal set of parameters little by little until the values of
the parameters converge to the optimal (expected) cost function. Also, it is
worthwhile to note that most iterative optimization algorithms are {\em local},
in the sense that they do not require us to evaluate the whole parameter space,
but only a small subset along the path from the starting point to the
convergence point.\footnote{
There are {\em global} optimization algorithms, but they are out of scope
for this course. See, for instance, \cite{Brochu2010} for one such algorithm
called Bayesian optimization.
}
Here I will describe the simplest one among those local iterative optimization
algorithms, called gradient descent (GD) algorithm. As the name suggests, this
algorithm depends entirely on the gradient of the cost function.\footnote{
From here on, I will use the cost function to refer to the {\em empirical}
cost function.
}
The gradient of a function $\nabla \tilde{C}$ is a vector whose direction points
to the direction of the greatest rate of increase in the function's value and
whose magnitude measures this rate. At each point $\TT_t$ in the parameter
space, the gradient of the cost function $\nabla \tilde{C}(\TT_t)$ is the {\em
opposite} direction toward which we want to move the parameters. See
Fig.~\ref{fig:grad} for graphical illustration.
\begin{figure}
\centering
\begin{minipage}{0.6\textwidth}
\includegraphics[width=\columnwidth]{figures/grad_figure.png}
\end{minipage}
\hfill
\begin{minipage}{0.39\textwidth}
\caption{(\textcolor{blue}{blue}) $f(x) = \sin(10 x) + x$.
(\textcolor{red}{red}) a gradient at $x=-0.6$.
(\textcolor{magenta}{magenta}) a negative gradient at $x=-0.6$.}
\label{fig:grad}
\end{minipage}
\end{figure}
One important point of GD that needs to be mentioned here is on how large a step
one takes each time. As clear from the magenta line (the direction opposite to
the direction given by the gradient) in Fig.~\ref{fig:grad}, if too large a step
is taken toward the negative gradient direction, the optimization process will
overshoot and miss the (local) minimum around $x=-0.8$. This step size, or
sometimes called learning rate, $\eta$ is one most important hyperparameter of
the GD algorithm.
Now we have all the ingredients for the GD algorithm: $\nabla \tilde{C}$ and
$\eta$. The GD algorithm iterates the following step:
\begin{align}
\label{eq:GD}
\TT \leftarrow \TT - \eta \nabla \tilde{C}(\TT).
\end{align}
The iteration continues until a certain stopping criterion is met, which we will
discuss shortly.
\subsection{Stochastic Gradient Descent}
\label{sec:sgd}
This simple GD algorithm works surprisingly quite well, and it is a fundamental
basis upon which many advanced optimization algorithms have been built. I will
present a list of few of those advanced algorithms later on and discuss them
briefly. But, before going into those advanced algorithms, let's solve one tiny,
but significant issue of the GD algorithm.
This tiny, but significant issue arises especially often in machine learning.
That is, it is computationally very expensive to compute $\tilde{C}$ and
consequently its gradient $\nabla \tilde{C}$, thanks to the ever increasing size
of the training set $D$.
Why is the growing size of the training set making it more and more
computationally demanding to compute $\tilde{C}$ and $\nabla \tilde{C}$? This
is because both of them are essentially the sum of as many per-sample costs as
there are examples in the training set. In other words,
\begin{align*}
&\tilde{C}(\TT) = \frac{1}{N} \sum_{n=1}^N \tilde{C}(\vx^n, \vy^n | \TT), \\
&\nabla \tilde{C}(\TT) = \frac{1}{N} \sum_{n=1}^N \nabla \tilde{C}(\vx^n, \vy^n | \TT).
\end{align*}
And, $N$ goes up to millions or billions very easily these days.
This enormous computational cost involved in each GD step has motivated the {\em
stochastic gradient descent} (SGD) algorithm~\cite{Robbins1951,Bottou1998}.
First, recall from Eq.~\eqref{eq:empirical_cost} that the cost function we
minimize is the {\em empirical} cost function $\tilde{C}$ which is the
sample-based approximation to the {\em expected} cost function $C$. This
approximation was done by assuming that the training examples were drawn
randomly from the data distribution $p_{\text{data}}$:
\begin{align*}
C(\TT) \approx \tilde{C}(\TT) = \frac{1}{N} \sum_{n=1}^N D(\hat{\vy^{n}},
\vy^{n}).
\end{align*}
In fact, as long as this assumption on the training set holds, we can always
approximate the expected cost function with a fewer number of training examples:
\begin{align*}
C(\TT) \approx \tilde{C}_\MM (\TT) = \frac{1}{|\MM|} \sum_{m \in \MM}
D(\hat{\vy^{m}}, \vy^m),
\end{align*}
where $M \ll N$ and $\MM$ is the indices of the examples in this much smaller
subset of the training set. We call this small subset a {\em minibatch}.
Similarly, this leads to a minibatch-based estimate of the gradient as well:
\begin{align*}
\nabla \tilde{C}_\MM (\TT) = \frac{1}{|\MM|} \sum_{m \in \MM} \nabla
D(\hat{\vy^m}, \vy^m).
\end{align*}
It must now be clear to you where I am headed toward. At each GD step, instead
of using the full training set, we will use a small subset $\MM$ which is
randomly selected to compute the gradient estimate. In other words, we use
$\tilde{C}_\MM$ instead of $\tilde{C}$, and $\nabla \tilde{C}_\MM$ instead of
$\nabla \tilde{C}$, in Eq.~\eqref{eq:GD}.
Because computing $\tilde{C}_\MM$ and $\nabla \tilde{C}_\MM$ is independent of
the size of the training set, we can use SGD to make as many steps as we want
without worrying about the growing size of training examples. This is highly
beneficial, as regardless of how many training examples you used to compute the
gradient, we can only take a tiny step toward that descending direction.
Furthermore, the increased level of noisy in the gradient estimate due to the
small sample size has been suspected to help reaching a better solution in
high-dimensional non-convex problems (such as those in training deep neural
networks) \cite{Lecun1998a}.\footnote{
Why would this be the case? It is worth thinking about this issue further.
}
We can set $M$ to be any constant, and in an extreme, we can set it to $1$ as
well. In this case, we call it online SGD.\footnote{
Okay, this is not true in a strict sense. SGD is an online algorithm with
$M=1$ originally, and using $M>1$, is a variant of SGD, often called,
minibatch SGD. However, as using minibatches ($M>1$) is almost always the
case in practice, I will refer to minibatch SGD as SGD, and to the original
SGD as online SGD.
} Surprisingly, already in 1951, it was shown that using a single example each
time is enough for the SGD to converge to a minimum (under certain conditions,
obviously) \cite{Robbins1951}.
This SGD algorithm will be at the core of this course and will be discussed
further in the future lectures.
\section{When do we stop learning?}
\label{sec:model_selection}
From here on, I assume that we approximate the ground truth function by
iteratively refining its set of parameters, in most cases using {\em stochastic
gradient descent}. In other words, learning of a machine that approximates the
true generating function $f$ happens gradually as the machine goes over the
training examples little by little over time.
Let us go over again what kind of constraints/issue we have first:
\begin{enumerate}
\item Lack of access to the expected cost function $C(\TT)$
\item Computationally expensive empirical cost function $\tilde{C}(\TT)$
\item (Potential) non-convexity of the empirical cost function
$\tilde{C}(\TT)$
\end{enumerate}
The most severe issue is that we do not have access to the expected cost
function which is the one we want to minimize in order to work well with {\em
any} pair of input $\vx$ and output $\vy$. Instead, we have access to the
empirical cost function which is a finite sample approximation to the expected
cost function.
\begin{figure}
\centering
\begin{minipage}{0.6\textwidth}
\includegraphics[width=\columnwidth]{figures/cost_figure.png}
\end{minipage}
\hfill
\begin{minipage}{0.39\textwidth}
\caption{(\textcolor{blue}{blue}) Expected cost function $C(\TT)$.
(\textcolor{red}{red}) Empirical cost function $\tilde{C}(\TT)$.
The underlying true generating function was $f(x) = \sin(10x) + x$.
The cost function uses the squared Euclidean distance.
The empirical cost function was computed based on 10 noisy
examples of which $x$'s were sampled from the uniform distribution
between $0$ and $1$. For each sample input $x$, noise from zero-mean Gaussian
distribution with standard deviation $0.01$ was added to $f(x)$ to emulate
the noisy measurement channel.}
\label{fig:cost}
\end{minipage}
\end{figure}
Why is this a problem? Because, we do not have a guarantee that the (local)
minimum of the empirical cost function corresponds to the (local) minimum of the
expected cost function. An example of this mismatch between the expected and
empirical cost functions is shown in Fig.~\ref{fig:cost}.
As in the case shown in Fig.~\ref{fig:cost}, it is not desirable to minimize the
empirical cost function perfectly. The parameters that perfectly minimize the
empirical cost function (in the case of
Fig.~\ref{fig:cost}, the slope $a$ of a linear function $f(x) = a x$) will
likely be a sub-optimal cost for the expected cost function about which we
really care.
\subsection{Early Stopping}
What should we do? There are many ways to avoid this weird contradiction where
we want to optimize the cost function well but not too well. Among those, one
most important trick is {\em early stopping}, which is only applicable when
iterative optimization is used.
First, we will split the training set $D$ into two partitions $D_{\train}$ and
$D_{\val}$.\footnote{
Later on, we will split it further into three partitions.
}
We call them a training set and a validation set, respectively. In practice it
is a good idea to keep $D$ much larger than $D'$, because of the reasons that
will become clear shortly.
Further, let us define the training cost as
\begin{align}
\label{eq:train_c}
\tilde{C}(\TT) = C_{\train}(\TT) = \frac{1}{|D_{\train}|} \sum_{(x,y) \in
D_{\train}} D(\hat{y}, y),
\end{align}
and the validation cost as
\begin{align}
\label{eq:val_c}
C_{\val}(\TT) = \frac{1}{|D_{\val}|} \sum_{(x,y) \in D_{\val}} D(\hat{y}, y).
\end{align}
With these two cost functions we are all ready to use early stopping now.
After every few updates using SGD (or GD), the validation cost function is
evaluated with the current set of parameters. The parameters are updated (i.e.,
the training cost function is optimized) until the validation cost does not
decrease, or starts to increase instead of decreasing.
That's it! It is almost free, as long as the size of the validation set is
reasonable, since each evaluation is at most as expensive as computing the
gradient of the empirical cost function. Because of the simplicity and
effectiveness, this early stopping strategy has become {\em de facto} standard
in deep learning and in general machine learning.
The question that needs to be asked here is what the validation cost function
does here. Clearly, it approximates the expected cost function $C$, similarly to
the empirical cost function $\tilde{C}$ as well as the training cost function
$C_{\train}$. In the infinite limit of the size of either training or validation
set, they should coincide, but in the case of a finite set, those two cost
functions differ by the noise in sampling (sampling pairs from the data
distribution) and observation (noise in $\vy=f(\vx)$.)
The fact that we explicitly optimize the training cost function implies that
there is a possibility (in fact, almost surely in practice) that the set of
parameters found by this optimization process may capture not only the
underlying generating function but also noise in the observation and sampling
procedure. This is an issue, because we want our machine to approximate the true
generating function not the noise process involved.
The validation cost function measures both the true generating structure as well
as noise injected during sampling and observation. However, assuming that noise
is not correlated with the underlying generating function, noise introduced in
the validation cost function differs from that in the training cost function. In
other words, the set of parameters that perfectly minimizes the training cost
function (thereby capturing even noise in the training set) will be penalized
when measured by the validation cost function.
\subsection{Model Selection}
\label{sec:hypothesis_space}
In fact, the use of the validation cost does not stop at the early stopping.
Rather, it has a more general role in model selection. First, we must talk about
model selection itself.
This whole procedure of optimization, or learning, can be cast as a process of
searching for the best {\em hypothesis} over the entire space $\HH$ of
hypotheses. Here, each hypothesis corresponds to each possible function (with a
unique set of parameters and a unique functional form) that takes the input
$\vx$ and output $\vy$. In the case of regression ($\vx \in \RR^d$ and $\vy \in
\RR$), the hypothesis space includes an $n$-th order polynomial function
\begin{align*}
f(x) = \sum_{\sum_{k=1}^d i_k = n, i_k \geq 0}
a_{i_1,i_2,\ldots,i_k} \prod_{k'=1}^d x_{k'}^{i_k},
\end{align*}
where $a_{i_1,i_2,\ldots,i_k}$'s are the coefficients, and
any other functional form that you can imagine as long as it can process $\vx$
and return a real-valued scalar. In the case of neural networks, this space
includes all the possible model architectures which are defined by the number of
layers, the type of nonlinearities, the number of hidden units in each layer and
so on.
Let us use
$M \in \HH$ to denote one hypothesis.\footnote{
$M$, because each hypothesis corresponds to one learning {\em machine}.
} One important thing to remember is that the parameter space is only a subset
of the hypothesis space, because the parameter space is defined by a family of
hypotheses (the parameter space of a linear function cannot include a set of
parameters for a second-order polynomial function.)
Given a definition of expected cost function, we can {\em score} each hypothesis
$M$ by the corresponding cost $C_M$. Then, the whole goal of function
approximation boils down to the search for a hypothesis $M$ with the minimal
expected cost function $C$. But, of course, we do not have access to the
expected cost function and resort to the empirical cost function based on a
given training set.
The optimization-based approach we discussed so far searches for the best
hypothesis based on the empirical cost iteratively. However, because of the
issue of {\em overfitting} which means that the optimization algorithm overshot
and missed the local minimum of the expected cost function (because it was aimed
at the local minimum of the empirical cost function), I introduced the concept
of early stopping based on the validation cost.
This is unfortunately not satisfactory, as we have only searched for the best
hypothesis inside a small subset of the whole hypothesis space $\HH$. What if
another subset of the hypothesis space includes a function that better suits the
underlying generating function $f$? Are we doomed?
It is clearly better to try more than one subsets of the hypothesis space. For
instance, for a regression task, we can try linear functions ($\HH_1$),
quadratic (second-order polynomial) functions ($\HH_2$) and sinusoidal functions
($\HH_3$). Let's say for each of these subsets, we found the best hypothesis
(using iterative optimization and early stopping); $M_{\HH_1}$, $M_{\HH_2}$ and
$M_{\HH_3}$. Then, the question is how we should choose one of those hypotheses.
Similar to what we've done with early stopping, we can use the validation cost
to compare these hypotheses. Among those three we choose one that has the
smallest validation cost $C_{\val}(M)$.
This is one way to do {\em model selection}, and we will talk about another way
to do this later.
\section{Evaluation}
But, wait, if this is an argument for using the validation cost to {\em early
stop} the optimization (or learning), one needs to notice something weird. What
is it?
Because we used the validation cost to stop the optimization, there is a chance
that the set of parameters we found is optimal for the validation set (whose
structure consists of both the true generating function and sampling/observation
noise), but not to the general data distribution. This means that we cannot tell