forked from fengdu78/Coursera-ML-AndrewNg-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path13 - 3 - Optimization Objective (7 min).srt
804 lines (603 loc) · 12.9 KB
/
13 - 3 - Optimization Objective (7 min).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
1
00:00:00,090 --> 00:00:01,540
Most of the supervised learning algorithms
2
00:00:01,690 --> 00:00:02,890
we've seen, things like linear
3
00:00:03,130 --> 00:00:04,730
regression, logistic regression and
4
00:00:04,930 --> 00:00:05,850
so on. All of those
5
00:00:06,300 --> 00:00:08,089
algorithms have an optimization objective
6
00:00:08,670 --> 00:00:10,920
or some cost function that the algorithm was trying to minimize.
7
00:00:11,920 --> 00:00:13,180
It turns out that K-means also
8
00:00:13,770 --> 00:00:15,730
has an optimization objective or
9
00:00:15,870 --> 00:00:18,720
a cost function that is trying to minimize.
10
00:00:19,630 --> 00:00:20,180
And in this video, I'd like to tell
11
00:00:20,230 --> 00:00:23,620
you what that optimization objective is.
12
00:00:23,730 --> 00:00:24,420
And the reason I want to do so
13
00:00:24,750 --> 00:00:26,960
is because this will be useful to us for two purposes.
14
00:00:28,020 --> 00:00:29,330
First, knowing what is the
15
00:00:29,480 --> 00:00:30,890
optimization objective of K-means
16
00:00:31,150 --> 00:00:32,390
will help us to
17
00:00:32,690 --> 00:00:33,970
debug the learning algorithm and
18
00:00:34,070 --> 00:00:35,080
just make sure that K-means is
19
00:00:35,300 --> 00:00:37,100
running correctly, and second,
20
00:00:37,610 --> 00:00:39,290
and perhaps even more importantly, in
21
00:00:39,530 --> 00:00:41,290
a later video we'll talk
22
00:00:41,490 --> 00:00:42,580
about how we can use this to
23
00:00:42,730 --> 00:00:44,000
help K-means find better clusters
24
00:00:44,070 --> 00:00:46,290
and avoid local optima, but we'll do that in a later video that follows this one.
25
00:00:46,410 --> 00:00:47,330
Just as a quick reminder, while K-means is
26
00:00:49,680 --> 00:00:52,870
running we're going to be
27
00:00:54,450 --> 00:00:55,820
keeping track of two sets of variables.
28
00:00:56,430 --> 00:00:58,390
First is the CI's and
29
00:00:58,700 --> 00:00:59,830
that keeps track of the
30
00:01:00,190 --> 00:01:01,600
index or the number of the cluster
31
00:01:02,730 --> 00:01:05,040
to which an example x(i) is currently assigned.
32
00:01:05,230 --> 00:01:05,960
And then, the other set of variables
33
00:01:06,540 --> 00:01:07,580
we use as Mu subscript
34
00:01:08,120 --> 00:01:09,410
K, which is the location
35
00:01:10,140 --> 00:01:12,110
of cluster centroid K. And,
36
00:01:12,380 --> 00:01:13,750
again, for K-means
37
00:01:14,030 --> 00:01:17,230
we use capital K to denote the total number of clusters.
38
00:01:17,890 --> 00:01:19,310
And here lower case K,
39
00:01:20,010 --> 00:01:20,910
you know, is going to be an
40
00:01:21,040 --> 00:01:22,650
index into the cluster
41
00:01:22,970 --> 00:01:23,930
centroids, and so lower
42
00:01:24,030 --> 00:01:24,940
case k is going to be
43
00:01:25,140 --> 00:01:26,390
a number between 1 and
44
00:01:26,600 --> 00:01:29,630
capital K. Now, here's
45
00:01:29,840 --> 00:01:31,040
one more bit of notation which
46
00:01:31,270 --> 00:01:32,280
is going to use Mu
47
00:01:32,630 --> 00:01:34,560
subscript c(i) to denote
48
00:01:34,970 --> 00:01:36,660
the cluster centroid of the
49
00:01:36,780 --> 00:01:38,400
cluster to which example x(i)
50
00:01:38,880 --> 00:01:40,500
has been assigned and
51
00:01:40,710 --> 00:01:42,030
to explain that notation
52
00:01:42,450 --> 00:01:43,450
a little bit more, let's
53
00:01:43,660 --> 00:01:45,600
say that x(i) has been
54
00:01:45,740 --> 00:01:47,760
assigned to cluster number five.
55
00:01:48,880 --> 00:01:49,830
What that means is that c(i),
56
00:01:50,850 --> 00:01:52,290
that is the index of x(i),
57
00:01:53,130 --> 00:01:54,300
that that is equal to 5.
58
00:01:54,420 --> 00:01:57,640
Right? Because you know, having c(i) equals 5,
59
00:01:57,800 --> 00:01:59,270
that's what it means for the
60
00:02:00,500 --> 00:02:01,720
example x(i) to be
61
00:02:01,910 --> 00:02:03,440
assigned to cluster number 5.
62
00:02:03,510 --> 00:02:05,700
And so Mu subscript
63
00:02:06,290 --> 00:02:07,960
c(i) is going to
64
00:02:08,100 --> 00:02:09,630
be equal to Mu subscript
65
00:02:10,080 --> 00:02:12,260
5 because c(i) is equal
66
00:02:13,700 --> 00:02:14,100
to 5.
67
00:02:15,100 --> 00:02:16,570
This Mu substitute c(i) is the
68
00:02:16,660 --> 00:02:18,420
cluster centroid of cluster number
69
00:02:18,730 --> 00:02:19,670
5, which is the cluster
70
00:02:20,120 --> 00:02:22,480
to which my example x(i) has been assigned.
71
00:02:23,470 --> 00:02:24,730
With this notation, we're now
72
00:02:24,960 --> 00:02:26,040
ready to write out what
73
00:02:26,200 --> 00:02:28,150
is the optimization objective of
74
00:02:28,290 --> 00:02:30,360
the K Mu clustering algorithm.
75
00:02:30,760 --> 00:02:30,800
And here it is.
76
00:02:31,330 --> 00:02:32,940
The cost function that K-means
77
00:02:33,040 --> 00:02:34,380
is minimizing is the
78
00:02:34,570 --> 00:02:35,770
function J of all of
79
00:02:35,880 --> 00:02:37,470
these parameters c1 through
80
00:02:37,890 --> 00:02:39,610
cM, Mu1 through MuK, that
81
00:02:39,790 --> 00:02:41,570
K-means is varying as the algorithm runs.
82
00:02:42,100 --> 00:02:43,930
And the optimization objective is shown
83
00:02:44,160 --> 00:02:45,520
on the right, is the average of
84
00:02:45,610 --> 00:02:46,430
one over M of the sum
85
00:02:46,620 --> 00:02:48,730
of i equals one through M of this term here
86
00:02:50,400 --> 00:02:52,670
that I've just drawn the red box around.
87
00:02:52,870 --> 00:02:54,680
The squared distance between
88
00:02:55,160 --> 00:02:57,540
each example x(i) and the
89
00:02:57,690 --> 00:02:58,740
location of the cluster
90
00:02:59,130 --> 00:03:00,210
centroid to which x(i)
91
00:03:01,320 --> 00:03:01,920
has been assigned.
92
00:03:03,240 --> 00:03:06,070
So let me just draw this in, let me explain this.
93
00:03:06,240 --> 00:03:07,800
Here is the location of training
94
00:03:08,190 --> 00:03:09,780
example x(i), and here's the location
95
00:03:10,410 --> 00:03:11,760
of the cluster centroid to which
96
00:03:11,970 --> 00:03:13,660
example x(i) has been assigned.
97
00:03:14,560 --> 00:03:17,080
So to explain this in pictures, if here is X1, X2.
98
00:03:17,420 --> 00:03:19,540
And if a point
99
00:03:19,760 --> 00:03:21,210
here, is my example
100
00:03:22,080 --> 00:03:23,060
x(i), so if that
101
00:03:23,110 --> 00:03:24,840
is equal to my example x(i),
102
00:03:25,860 --> 00:03:27,000
and if x(i) has been assigned
103
00:03:27,240 --> 00:03:28,270
to some cluster centroid, and
104
00:03:28,340 --> 00:03:30,240
I'll denote my cluster centroid with a cross.
105
00:03:30,630 --> 00:03:32,130
So if that's the location of,
106
00:03:32,300 --> 00:03:33,830
you know, Mu 5, let's
107
00:03:34,370 --> 00:03:35,640
say, if x(i) has been
108
00:03:35,850 --> 00:03:37,960
assigned to cluster centroid 5 in my example up there.
109
00:03:38,810 --> 00:03:40,660
Then, the squared distance, that's
110
00:03:40,940 --> 00:03:41,840
the squared of the distance
111
00:03:43,810 --> 00:03:46,010
between the point x(i) and this
112
00:03:46,220 --> 00:03:48,400
cluster centroid, to which x(i) has been assigned.
113
00:03:49,570 --> 00:03:50,720
And what K-means can be shown
114
00:03:51,070 --> 00:03:52,540
to be doing is that, it
115
00:03:52,680 --> 00:03:54,480
is trying to find parameters c(i)
116
00:03:55,270 --> 00:03:57,410
and Mu(i), try to
117
00:03:57,570 --> 00:03:58,840
find cMU to try to
118
00:03:58,960 --> 00:04:00,450
minimize this cost function J.
119
00:04:01,440 --> 00:04:03,180
This cost function is sometimes
120
00:04:03,680 --> 00:04:06,770
also called the distortion cost
121
00:04:07,060 --> 00:04:10,030
function or the distortion of
122
00:04:10,240 --> 00:04:12,130
the K-means algorithm.
123
00:04:12,790 --> 00:04:13,360
And, just to provide a little bit more
124
00:04:13,630 --> 00:04:15,750
detail, here's the K-means algorithm,
125
00:04:15,820 --> 00:04:16,450
Here's exactly the algorithm as we have it, in the real
126
00:04:16,610 --> 00:04:17,960
form the earlier slide.
127
00:04:18,950 --> 00:04:20,200
And what this first step
128
00:04:21,030 --> 00:04:23,120
of this algorithm is, this was
129
00:04:23,830 --> 00:04:25,910
the cluster assignment step
130
00:04:27,920 --> 00:04:29,850
where we assign each
131
00:04:30,030 --> 00:04:32,910
point to the cluster centroid, and
132
00:04:33,010 --> 00:04:34,830
it's possible to show mathematically that
133
00:04:35,050 --> 00:04:36,210
what the cluster assignment step
134
00:04:36,450 --> 00:04:38,560
is doing is exactly minimizing
135
00:04:40,770 --> 00:04:42,950
J with respect
136
00:04:43,420 --> 00:04:45,900
to the variables, C1, C2
137
00:04:46,380 --> 00:04:48,050
and so on, up
138
00:04:48,170 --> 00:04:52,030
to C(m), while holding the
139
00:04:52,480 --> 00:04:54,240
closest centroids, MU1 up to
140
00:04:54,720 --> 00:04:57,000
MUK fixed.
141
00:04:58,580 --> 00:04:59,640
So, what the first assignment step
142
00:04:59,900 --> 00:05:00,990
does is you know, it doesn't change
143
00:05:01,240 --> 00:05:02,850
the cluster centroids, but what it's
144
00:05:02,960 --> 00:05:05,730
doing is, exactly picking the values of C1, C2, up to CM
145
00:05:07,790 --> 00:05:10,240
that minimizes the cost
146
00:05:10,500 --> 00:05:11,790
function or the distortion function,
147
00:05:12,510 --> 00:05:14,440
J. And it's possible to prove
148
00:05:14,670 --> 00:05:16,550
that mathematically but, I won't do so here.
149
00:05:17,170 --> 00:05:18,210
That has a pretty intuitive meaning
150
00:05:18,610 --> 00:05:19,630
of just, yo know, well let's assign
151
00:05:20,090 --> 00:05:21,040
these points to the cluster centroid
152
00:05:21,530 --> 00:05:22,820
that is closest to it, because
153
00:05:23,120 --> 00:05:24,160
that's what minimizes the square
154
00:05:24,660 --> 00:05:26,860
of distance between the points and the corresponding cluster centroid.
155
00:05:27,840 --> 00:05:29,090
And then the other part of
156
00:05:29,790 --> 00:05:32,880
the second step of K-means, this second step over here.
157
00:05:33,960 --> 00:05:35,480
This second step was the move
158
00:05:35,690 --> 00:05:38,770
centroid step and,
159
00:05:39,000 --> 00:05:40,020
once again, I won't prove it,
160
00:05:40,510 --> 00:05:41,250
but it can be shown
161
00:05:41,520 --> 00:05:42,590
mathematically, that what the
162
00:05:43,140 --> 00:05:44,910
root centroid step does, is
163
00:05:45,150 --> 00:05:46,740
it chooses the values
164
00:05:47,260 --> 00:05:49,370
of mu that minimizes J.
165
00:05:50,150 --> 00:05:51,270
So it minimizes the cost function
166
00:05:51,650 --> 00:05:53,000
J with respect to,
167
00:05:53,380 --> 00:05:54,710
where wrt is my
168
00:05:54,920 --> 00:05:56,930
abbreviation for with respect to.
169
00:05:57,030 --> 00:05:58,380
But it minimizes J with respect
170
00:05:58,790 --> 00:06:01,930
to the locations of the cluster centroids, Mu1 through MuK.
171
00:06:02,040 --> 00:06:05,690
So, what K-means really
172
00:06:05,790 --> 00:06:06,910
is doing is it's taking the
173
00:06:07,010 --> 00:06:08,380
two sets of variables and
174
00:06:09,070 --> 00:06:11,210
partitioning them into two halves right here.
175
00:06:11,550 --> 00:06:14,490
First the C set of variables and then you have the Mu sets of variables.
176
00:06:15,450 --> 00:06:15,990
And what it does is it first
177
00:06:16,560 --> 00:06:17,750
minimizes J with respect
178
00:06:18,050 --> 00:06:19,350
to the variable C, and then minimizes
179
00:06:19,700 --> 00:06:20,610
J with respect the variables
180
00:06:21,120 --> 00:06:22,590
Mu, and then it keeps on iterating.
181
00:06:25,180 --> 00:06:26,680
And so that's all that K-means does.
182
00:06:27,700 --> 00:06:28,570
And, now that we understand
183
00:06:29,150 --> 00:06:30,870
K-means, let's try to
184
00:06:31,030 --> 00:06:32,190
minimize this cost function J. We
185
00:06:32,430 --> 00:06:33,640
can also use this to
186
00:06:33,800 --> 00:06:34,890
try to debug our learning
187
00:06:35,090 --> 00:06:36,350
algorithm and just kind
188
00:06:36,520 --> 00:06:37,980
of make sure that our implementation
189
00:06:38,900 --> 00:06:39,950
of K-means is running correctly.
190
00:06:41,220 --> 00:06:42,560
So, we now understand the
191
00:06:43,070 --> 00:06:44,260
K-means algorithm as trying to
192
00:06:44,610 --> 00:06:45,960
optimize this cost function J,
193
00:06:46,640 --> 00:06:48,790
which is also called the dispulsion function.
194
00:06:50,650 --> 00:06:51,600
We can use that to debug K-means
195
00:06:52,090 --> 00:06:53,060
and help me show that K-means
196
00:06:53,130 --> 00:06:54,050
is converging, and that it's
197
00:06:54,510 --> 00:06:56,160
running properly, and in the
198
00:06:56,240 --> 00:06:57,460
next video, we'll also see
199
00:06:57,690 --> 00:06:59,040
how we can us this to
200
00:06:59,120 --> 00:07:00,650
help K-means find better clusters
201
00:07:01,300 --> 00:07:03,240
and help K-means to avoid local optima.