-
Notifications
You must be signed in to change notification settings - Fork 0
/
my_quiz.quiz
1647 lines (1565 loc) · 67 KB
/
my_quiz.quiz
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
// example quiz text
// <- (this is a comment and will be ignored)
// this is the url for your quiz
"url": "http://jgbdev.github.io/ml_quiz/",
// this are your UoB candidate numbers as a comma separated list
"candidate_number": [0],
// this is the title of the quiz
"title": "COMS30301 Quiz",
"1": {
"difficulty": "1",
"reference": "1.1",
"problem_type": "definitions",
"question" : "Depending on the target goal for machine learning, different tasks are used . <br>\
Use the cards below to match the target type to the machine learning task.",
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answer_type": "matrix_sort_answer",
"answers": [
{ "correctness": "Binary and multi-class classification",
"answer": "categorical",
"explanation": "Often used to define a label categorical "
},
{ "correctness": "Regression" ,
"answer": "numerical",
"explanation": "Regression is used to output a numerical number"
},
{ "correctness": "Clustering",
"answer": "hidden",
"explanation": "Cluster is used to find hidden variables as often this is \
used in un supervised learning where the class/value is unknown the model"
}
],
"hint": "Use the textbook to look up the definitions",
"workings": "From the definition in textbook <br>\
Regression -> numical output,<br>\
Binary and multi-class classification -> Categorical answer<br>\
Clustering -> Hidden variables ",
"source": "Textbook 1.1",
"comments": "Can be answered by looking up a passage in textbook, 1 difficulty"
},
"2": {
"difficulty": "3",
"reference": "5.1",
"problem_type": "training",
"question" : "Terry wants to use his past experiences with car purchases \
to help him avoid picking a new car he would later regret <br>\
He collects the data from his previous car purchases, which is included \
below. <br>\
<table>\
<tr>\
<th>Price</th>\
<th>Persons</th>\
<th>Safety</th>\
<th>Acceptable</th>\
</tr>\
<tr>\
<td>vhigh<br></td>\
<td>more</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>2</td>\
<td>med</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>2</td>\
<td>high</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>more</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>more</td>\
<td>med</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>vhigih</td>\
<td>2</td>\
<td>med</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>high</td>\
<td>2</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>high</td>\
<td>4</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>high</td>\
<td>4</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>med</td>\
<td>2</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>med</td>\
<td>2</td>\
<td>med</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>med</td>\
<td>2</td>\
<td>low</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>med</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>more</td>\
<td>high</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>2</td>\
<td>med</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>med</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>high</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>med</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>4</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>low</td>\
<td>more</td>\
<td>med</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>high</td>\
<td>more</td>\
<td>high</td>\
<td>acc</td>\
</tr>\
<tr>\
<td>vhigh</td>\
<td>2</td>\
<td>low</td>\
<td>unacc</td>\
</tr>\
</table>\
He asks you to build a decision model using the training data (below), \
using Entropy to for the best split and taking unacc as \
the positive class.<br> \
<br>\
When building the tree, stop splitting a node if the number of instances in the node is \
five or less or the majority class is above 90%. <br>\
Label each node with the majority class. <br>\
After building the tree, evaluate the model by classifying the training data. <br> \
<br>\
Record your results in a contingency table and calculate the precision. <br>\
Select the answer which is equal to your value. <br> ",
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answer_type": "single",
"answers": [
{ "correctness": "-",
"answer": "0.72"
},
{ "correctness": "+",
"answer": "0.875",
"explanation": "Taking the final contingency table from workings<br>\
Contingency table<br>\
[14, 2]<br>\
[2, 11 ]<br>\
Precision = 14/(14+2) = 0.875"
},
{ "correctness": "-",
"answer": "0.912"
},
{ "correctness": "-",
"answer": "0.845"
}
],
"hint": "Entropy calculation $-p\\log_{2}p-(1-p)\\log_{2}(1-p)$ ",
"workings": "\
\
Calculated the weighted entropy of the children at each split using <br><br>\
$$\\sum_{i=1}\\frac{n_{i}}{n}(-p\\log_{2}p-(1-p)\\log_{2}(1-p))$$<br><br>\
To maximise the information gain, I choose the split with the lowest weighted entropy<br><br>\
Split 1:<br>\
Detailed calculation for Safety split:<br><br>\
$-0.9\\log_{2}(0.9)-0.1\\log_{2}(0.1) = 0.469$<br>\
$-\\frac{4}{9}\\log_{2}(\\frac{4}{9})-\\frac{5}{9}\\log_{2}(\\frac{5}{9}) = 0.991$<br>\
$-0.3\\log_{2}(0.3)-0.7\\log_{2}(0.7) = 0.881$<br>\
Weighted Entropy = $\\frac{10}{29}0.469+\\frac{9}{29}0.991+\\frac{10}{29}0.881=0.773$<br>\
Safety [9+,1-][4+,5-][3+,7-] : Weighted entropy = 0.77(Chosen)<br>\
Price [7+,2-][1+,3-][2+,1-][6+,7-] : Weighted entropy = 0.89<br>\
Persons [6+,6-][7+,2-][3+,5-] : Weighted entropy = 0.914<br><br>\
Split 2: Saftey Medium<br>\
Price [2+,2-][1+,0-][1+,3-] Weighted entropy = 0.805<br>\
Persons [3,-1][1,-3][0+,1-] Weighted entropy = 0.72(Chosen)<br>\
<br><br>\
Split 3: Saftey High<br>\
Price [2+,0-][0+,3-][1+,4-] Weighted entropy = 0.361(Chosen)<br>\
Price [2+,0-][0+,3-][2+,3-] Weighted entropy = 0.490<br>\
<br><br>\
<b>Final Tree :</b><br>\
Showing the route from route to branch<br>\
The feature in parentesis is the splitting feature, followed by the value on the current \
path. <br>\
(Safety) low [9+,-1]<br>\
(Safety) med - (Persons) 2 [3+,-1]<br> \
(Safety) med - (Persons) 4 [1+,-3] <br>\
(Safety) med - (Persons) more [0,-1]<br>\
\
(Safety) high - (Price) vhigh [2+,-0]<br>\
(Safety) high - (Price) high [0+,-3]<br>\
(Safety) high - (Price) low [1+,-4]<br>\
<br>\
We can now easily build a contingency table by classifying the training data<br>\
$$\\begin{vmatrix} \
&Pred + &Pred - & \\\\ \
Actual +& 14 &2 & 16 \\\\ \
Actual -& 2 &11 & 13 \\\\ \
& 16& 13 & 29\
\\end{vmatrix}$$\
<br><br>\
Precision = $\\frac{TP}{TP+FP}$<br>\
Precision = 14/(14+2) = 0.875",
"source": "Textbook 1.2 and 5.1",
"comments": "\
Requires candidate to have a strong understanding of how a Decision Tree is formed.\
Must have a knowledge of how to use the Entropy equation, what it represents and why \
it's used to split the tree.<br>\
After building the tree the candidate has a further task of calculating the precision\
which checks they can evaluate how well the Decision tree has classified the training \
"
},
"3": {
"difficulty": "3",
"reference": "9.2",
"problem_type": "training",
"question" : "\
Doctor X is investigating the symptoms of this new disease which has hit his \
patients. He wants to avoid expensive blood tests and instead use the symptoms reported by \
his patients as a diagnosis tool. <br>\
He asks ten patients to record how many times during the week they experience the following \
symptoms: <br>\
Stomach Ache, Back Pain, Cough, Sneezing.<br>\
He then performs a blood test on the ten patients to test if they have the disease.<br>\
<br>\
He records their results and asks you to build a Naive Bayes Model \
(taking into account frequency of symptoms). <br>\
<br>\
The results are in the table below.\
<table><tr><th>Patient</th><th>Stomach Ache</th><th>Back Pain</th><th>Cough</th><th>Sneezing</th><th>Has Disease</th></tr><tr><td>p1</td><td>2</td><td>3</td><td>1</td><td>0</td><td>1</td></tr><tr><td>p2</td><td>2</td><td>2</td><td>2</td><td>1</td><td>1</td></tr><tr><td>p3</td><td>1</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>p4</td><td>0</td><td>3</td><td>1</td><td>1</td><td>1</td></tr><tr><td>p5</td><td>2</td><td>1</td><td>3</td><td>0</td><td>1</td></tr><tr><td>p6</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr><tr><td>p7</td><td>0</td><td>2</td><td>0</td><td>1</td><td>0</td></tr><tr><td>p8</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>p9</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td></tr><tr><td>p10</td><td>1</td><td>2</td><td>0</td><td>0</td><td>0</td></tr></table>\
He also tells you that the accepted probability of someone in the population having the \
disease is 0.2. <br>\
He then asks you to diagnose four test patients using the symptom table below. <br>\
<table><tr><th>Patient</th><th>Stomach Ache</th><th>Back Pain</th><th>Cough</th><th>Sneezing</th></tr><tr><td>t1</td><td>1</td><td>2</td><td>1</td><td>2</td></tr><tr><td>t2</td><td>1</td><td>2</td><td>2</td><td>1</td></tr><tr><td>t3</td><td>3</td><td>0</td><td>2</td><td>0</td></tr><tr><td>t4</td><td>2</td><td>1</td><td>1</td><td>2</td></tr></table>\
Tick below which patient your model diagnosed with the disease.<br>\
<br>\
In an alternative scenario the Doctor discards the frequency of the symptoms. <br>\
Retrain the Bayesian Model, but this time disregard the frequency of the symptoms, \
by only taking into account if the symptom occurred.<br>\
<br>\
Retest the test patients and tick the 'Changed' box if the new model classifies results \
differently to the previous one.<br>\
",
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answer_type": "multiple",
"answers": [
{ "correctness": "-",
"answer": "t1",
"explanation": "Ratio < 1"
},
{ "correctness": "+",
"answer": "t2",
"explanation": "Ratio > 1"
},
{ "correctness": "+",
"answer": "t3",
"explanation": "Ratio > 1"
},
{ "correctness": "-",
"answer": "t4",
"explanation": "Ratio < 1"
},
{ "correctness": "+",
"answer": "Changed",
"explanation": "Classification of second model different to first"
}
],
"hint": "Make sure you use Laplace correction to smooth out the extreme results <br>\
Focus on calculating the ratios been the probabilities of being in either class, <br>\
the numerical probabilities have little relevance on their own.",
"workings": "<br>\
Count Model:<br>\
With symptoms in order, summing each class generated the following counts:<br>\
p: 7 10 7 2 <br>\
n: 3 6 0 2<br>\
Using Laplace correction, I calculated the probabilities for having a symptom in <br>\
each class<br>\
<br>\
p: ($\\frac{4}{15}$, $\\frac{11}{30}$, $\\frac{4}{15}$, $\\frac{1}{10}$ )<br>\
n: ($\\frac{4}{15}$, $\\frac{7}{15}$, $\\frac{1}{15}$, $\\frac{1}{5}$ )<br>\
<br>\
If we divide the probability of each of the positive features, by the negative features <br>\
we obtain the following probabilities.<br>\
(1, $\\frac{11}{14}$, 4, 0.5)<br>\
<br>\
Therefore we can obtain following likelihood using the ratio between the probability of each <br>\
feature and the frequency it occurs in the testing point. <br>\
Dividing the positive and negative probabilities for each feature we get the following <br>\
(1, $\\frac{11}{14}$ , 4 , 0.5)<br>\
<br>\
This can now calculate the likelihood ratio: <br>\
$$1^{f1} \\frac{11}{14} ^{f2}4^{f3} 0.5 ^{f4}$$ <br>\
<br>\
Where fn is the count of feature n in the data point being classified.<br>\
<br>\
We calculate the prior ratio: $\\frac{P(+)}{P(-)}$.<br>\
Prior ratio = $\\frac{0.2}{1-0.2} = 0.25$.<br>\
<br>\
To take this into account in our likelihood ratio we multiply it by our prior ratio.<br>\
$$\\frac{11}{14} ^{f2}4^{f3} 0.5 ^{f4} 0.25$$ <br>\
<br>\
When the likelihood ratio is above 1 the $P(+|x) > P(-|x)$, therefore we classify as positive.<br>\
If the likelihood ratio is less that 1 we classify as negative <br>\
<br>\
Classifying our 4 test patients : <br>\
t1: $\\frac{11}{14} ^{2}4^{1} 0.5 ^{2} 0.25 = \\frac{121}{754}$ (negative)<br>\
t2: $\\frac{121}{95}$ (positive)<br>\
t3: $4$ (positive)<br>\
t5: $\\frac{11}{15}$ (negative)<br>\
<br>\
t2,t3 classed as positive.<br>\
<br>\
Now train again, this time not taking into account frequency of symptoms.<br>\
Modify the symptoms so that any value above 0 is replaced with 1.<br>\
<br>\
Sum the symptoms in each class<br>\
p: 4 5 4 2<br>\
n: 2 4 0 3<br>\
<br>\
Using Laplace correction I generate the probabilities for each feature in each class.<br>\
p: ($\\frac{5}{7}$,$\\frac{6}{7}$,$\\frac{5}{7}$,$\\frac{3}{7}$) <br>\
n: ($\\frac{3}{7}$,$\\frac{5}{7}$,$\\frac{1}{7}$,$\\frac{3}{7}$)<br>\
<br>\
I can now use the likelihood ratio * prior ratio (0.25) to classify the test data.<br>\
Like above, a value above 1 will be classed as positive. <br>\
A value below 1 will be classed as negative<br>\
<br>\
For classifying a test point, if a given feature is positive we take the probability class class \
+ contains the feature divided by the probability in class - contains the feature. \
If the feature is negative we take the probability each class doesn't contain the feature \
by taking $ 1 - p $ instead. <br>\
<br>\
We multiply all the values to get our likelihood ratio.\
<br>\
t1 = (1,1,1,1)<br>\
ratio: $\\frac{5}{3}\\frac{6}{5}\\cdot5\\cdot 1 \\cdot 0.25 = 2.5$ (Positive)<br>\
<br>\
t2 = (1,1,1,1)<br>\
ratio: $\\frac{5}{3}\\frac{6}{5}\\cdot5\\cdot 1 \\cdot 0.25 = 2.5$ (Positive)<br>\
<br>\
t3 = (1,0,1,0)<br>\
ratio: $\\frac{5}{3}\\frac{1}{2}\\cdot5\\cdot 1 \\cdot 0.25 = \\frac{25}{24}$ (Positive)<br>\
<br>\
t3 = (1,1,1,0)<br>\
ratio: $\\frac{5}{3}\\frac{6}{5}\\cdot5\\cdot 1 \\cdot 0.25 = 2.5$ (Positive)<br>\
<br>\
t1,t2,t3,t4 classified as positive.<br>\
<br>\
",
"source": "Textbook 9.2",
"comments": "Requires the candidate to have knowledge of how a Bayesian model is trained for count and boolean\
values. The candidate also needs to be able to correctly use the trained model for classification.\
Requires a significant amount of calculation, justifying difficulty 3\
"
},
"4": {
"difficulty": "3",
"reference": "6.3",
"problem_type": "calculation",
"question" : "Dave's Electronic store wants to group items which sell well together. <br>\
So he can answer the question, if the customer buys product A then they will <br>\
likely buy product B.<br><br>\
Todo this he records the transactions for one day and presents them to you in the table below.<br><br>\
<table><tr><th>ID</th><th>Transaction Contents</th></tr><tr><td>1</td><td>Camera, SD Card</td></tr><tr><td>2</td><td>Camera, SD Card</td></tr><tr><td>3</td><td>USB Cable</td></tr><tr><td>4</td><td>Phone, USB Cable</td></tr><tr><td>5</td><td>Phone, USB Cable</td></tr><tr><td>6</td><td>Phone, USB Cable, Camera</td></tr><tr><td>7</td><td>Phone, USB Cable, Phone</td></tr><tr><td>8</td><td>USB Cable, Phone</td></tr><tr><td>9</td><td>Camera, SD Card</td></tr><tr><td>10</td><td>Phone, USB Cable</td></tr><tr><td>11</td><td>USB Cable</td></tr><tr><td>12</td><td>USB Cable</td></tr><tr><td>13</td><td>SD Card</td></tr><tr><td>14</td><td>SD Card</td></tr></table>\
Using the transactions, form two rules. <br><br>\
Rule 1: If X then Y <br> \
Rule 2: If J then K <br><br> \
By matching the items to the corresponding letter below. <br>\
Ensure that Rule 1 has higher confidence then Rule 2.<br><br>",
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answer_type": "matrix_sort_answer",
"answers": [
{ "correctness": "X ",
"answer": "Phone",
"explanation": "Top rule if X is Phone "
},
{ "correctness": "Y " ,
"answer": "USB Cable",
"explanation": "Top rule if Phone then Y is USB Cable"
},
{ "correctness": "J ",
"answer": "Camera",
"explanation": "Second highest rule: if X is Camera"
},
{ "correctness": "K ",
"answer": "SD Card",
"explanation": "Second highest rule if Camera then Y is USB Cable"
}
],
"hint": "Use the AssociationRules algorithm to generate a set of Maximal frequent items <br>\
Draw a set lattice to visualise the transactions, making it easier to complete the Association Rule \
algorithm. <br> Do some post processing to check for superfluous rules. \
",
"workings": "\
<br>\
We use the AssociationRules algorithm to generate a set of association rules<br>\
that exceed your given threshold for support and confidence.<br>\
<br>\
Support is the number of transactions that contain the target item.<br>\
Confidence calculates the ratio of transactions that fit the target rule.<br>\
Support is used calculated confidence. <br>\
E.g Confidence of rule (if X then Y) : $Support(X \\cup Y)/Support(Y)$<br>\
<br>\
<br>\
We set a reasonable threshold of 0.6 confidence and 3 support.<br>\
<br>\
First we need to generate a maximal item set to generate a set of items to <br>\
find rules from.<br>\
<br>\
We use the FrequentItems algorithm to with a threshold of 3 to find the sets. <br>\
<br>\
Starting with the empty set. <br>\
Extend to SD Card.<br>\
{SD Card}, support 6 (Add to queue, not max).<br>\
Extend {USB Cable}<br>\
{USB Cable}, support 8 (Add to queue, not max).<br>\
<br>\
Next: SD Card<br>\
Extend {SD Card, Camera}<br>\
{SD Card, Camera} Support 4 (Add to queue, not max)<br>\
Extend {SD Card, Phone }<br>\
{SD Card, Phone }, support 1<br>\
<br>\
Next: USB Cable<br>\
Extend {USB Cable, Phone}<br>\
{USB Cable, Phone}, support 5 (Add to queue, not max)<br>\
Extend {USB Cable, Camera} <br>\
{USB Cable, Camera}, support 1<br>\
<br>\
Next: {SD Card, Camera}<br>\
Extend {SD Card, Camera, Phone}<br>\
{SD Card, Camera, Phone}, support 1<br>\
max = true, add to max item set<br>\
<br>\
Next: {USB Cable, Phone}<br>\
Extend {USB Cable, Phone, Camera}<br>\
{USB Cable, Phone, Camera}, support 1<br>\
max = true, add to max item set<br>\
<br>\
Output maximal item sets{{USB Cable, Phone} ,{USB Cable, Camera} } <br>\
<br>\
Next calculate the support and confidence for the rules in each of <br>\
the maximal item sets<br>\
<br>\
<br>\
Below is the list of all possible rules <br>\
<br>\
if phone then usb cable: support 5 confidence 0.83 lift 1.46<br>\
if camera then sd card: support 4 confidence 0.80 lift 1.87<br>\
if sd card then camera : support 4 confidence 0.67 lift 1.87<br>\
if usb cable then phone: support 5 confidence 0.63 lift 1.46<br>\
<br>\
Calculation for lift: Lift(if X then Y) = $\\frac{n\\cdot Supp(X\\cup Y)}{Supp(Y)Supp(X)}<br>\
<br>\
We also calculate the lift to filter out superfluous rules<br>\
Remove any values 1 or less<br>\
If lift (if X then Y) <= 1, no better than doing if True then Y <br>\
All four rules have a lift > 1, so no need to filter<br>\
Take the top 2 rules<br>\
if phone then usb cable <br>\
if camera then sd card",
"source": "Textbook 6.3",
"comments": "\
Candidate has to have the understanding of what the support, confidence \
formulas are and how they can be applied to the transaction list. <br>\
The candidate also has to have the knowledge of the pre processing step called \
lift and how to use it to filter out superfluous results. <br>\
The question involves a lot of calculation to calculate the support and confidence \
for each rule.\
"
},
"5": {
"difficulty": "3",
"reference": "6.2",
"problem_type": "evaluation",
"answer_type": "blank_answer",
"question": [ \
"<b>Feature list</b> <br>\
<table><tr><th>Positive/Negative</th><th>Colour</th><th>Cruise Control</th><th>Doors</th></tr><tr><td>p1</td><td>Red</td><td>yes</td><td>3</td></tr><tr><td>p2</td><td>Red</td><td>yes</td><td>3</td></tr><tr><td>p3</td><td>Red</td><td>No</td><td>5</td></tr><tr><td>p4</td><td>Blue</td><td>No</td><td>5</td></tr><tr><td>p5</td><td>Red</td><td>Yes</td><td>5</td></tr><tr><td></td><td></td><td></td><td></td></tr><tr><td>n1</td><td>Blue</td><td>No</td><td>5</td></tr><tr><td>n2</td><td>Green</td><td>No</td><td>5</td></tr><tr><td>n3</td><td>Green</td><td>Yes</td><td>5</td></tr><tr><td>n4</td><td>Blue</td><td>Yes</td><td>3</td></tr><tr><td>n5</td><td>Red</td><td>Yes</td><td>5</td></tr></table>\
<br><br>\
In this question we explore the differences \
between ordered and unordered lists by comparing their accuracy on \
a ROC curve <br> \
<br> \
To answer the question, preform the following steps <br>\
1. From the feature list above, pick two single rules, A and B that captures the most \
of each class <br> \
2. Generate two ordered rule lists, swapping A and B as the \
first rule <br>\
3. Calculate the AUC for each rule list, making note of the highest.<br>\
4. Calculate a rule tree by calculating the overlap between the rules. <br>\
5. Calculate the AUC for the rule tree <br>\
6. Fill in the fields below for the AUC of highest rule set and rule tree<br>\
AUC for ordered rule list is: ", 1,". <br>",
"AUC for rule tree: ", 2, ".<br>"
],
"answers": [
{ "correctness": 1,
"answer": "0.82",
"explanation": "See workings"
},
{ "correctness": 2,
"answer": "0.84",
"explanation": "See workings"
}
],
"hint": "The AUC can be calculated as the area under a ROC curve<br> \
When calculating the rule tree, look back at the rule list to see which \
rules overlap.<br>\
When building the rule list, exclude the items covered in the previous rule\
",
"workings": "\
For the positive class, colour = red covers the most at 4.<br>\
For the negative class Doors = 5 covers the most at 4.<br>\
Therefore create two rules A, B.<br>\
A: Colour = red [p1-p3 p5 | n5]<br>\
B: Doors = 5 [ p3 - p5 | n1 - n3 n5] <br>\
This forms the two ordered lists <br>\
<br>\
1:<br>\
if A then + [4+ 1-]<br>\
else if B then - [1+ 3-]<br>\
else - [0+ 1-]<br>\
<br>\
error = $4\\cdot1\\cdot0.5 + 3\\cdot1*0.5 + 1 = 4.5$<br>\
<br>\
2:<br>\
if B then - [3+ 4-]<br>\
else if A + [2+ 0-]<br>\
else - [0 1-]<br>\
error = 6 = 3*4*0.5<br>\
Rule order AB has lowest error<br><br>\
AUC can be calculated as the area under the ROC curve,<br>\
AUC = (error-total_area)/total_area = $(25-4.5)/25 = 0.82$<br>\
<br>\
Next generate rule tree<br>\
<br>\
To do so we have to calculate the intersection of A and B.<br>\
As we have access to the training data we can calculate this, <br>\
else we would have estimate. <br>\
<br>\
A and B [2+ 1-]<br>\
We can also calculate further intersections<br>\
A and !B [2+ 0-]<br>\
B and !A [1+ 3-]<br>\
<br>\
We can experiment with two trees, with either A or B at root.<br>\
A root:<br>\
Split <br>\
A [4+ 1-]<br>\
!A [1+ 4-]<br>\
We can split A branch further<br>\
A and B [2+ 1-]<br>\
A and !B [2+ 0-]<br>\
<br>\
This generates the following ranking [2+ 0-][2+ 1-][1+ 4-]<br>\
Gives an error of 4, AUC = 0.84<br>\
<br>\
B root:<br>\
Split<br>\
B [3+ 4-]<br>\
B! [2+ 1-]<br>\
<br>\
B branch can be further split<br>\
B and A [2+ 1-]<br>\
B and !A [1+ 3-]<br>\
<br>\
This generates the ranking [2+ 1-][2+ 1-][1+ 3-]<br>\
Error = $2\\cdot0.5\\cdot2*0.5+3\\cdot0.5+2+2=7.5$<br>\
<br>\
A at root is picked as it has the lowest error and therefore highest AUC<br>\
",
"source": "Textbook 6.3 and 6.1",
"comments": "\
The candidate has to be able to pick the best rules out the rule list. \
Using that the candidate then has to generate ordered rule lists and evaluate how well they perform by ranking the leaves.\
The candidate has to do this again but with an unordered rule set \
which involves calculating the overlapping regions in order to generate the \
rule tree"
},
"6": {
"difficulty": "3",
"reference": "8.3",
"problem_type": "calculation",
"question" : "K-means algorithm was used to generate the following clustering, <br>\
$$\
\\begin{bmatrix} \
2&3 \\\\ \
7 & \\frac{22}{3} \
\\end{bmatrix}\
$$\
with one of the datasets below. <br>\
Apply k-means algorithm to each of the datasets to find which one was used. <br>\
Apply k-means with these initial centroids<br>\
[0 1] and [4 4] <br>",
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answer_type": "single",
"answers": [
{ "correctness": "+",
"answer": "w:<br>\
$\
\\begin{bmatrix} \
1&2 \\\\ \
3 & 4 \\\\ \
5 & 6 \\\\ \
7 & 8 \\\\ \
9 & 8 \
\\end{bmatrix}\
$",
"explanation": "Final iteration for w in workings matches the clustering matrix above "
},
{ "correctness": "-",
"answer": "x:<br>\
$\
\\begin{bmatrix} \
1&2 \\\\ \
3 & 4 \\\\ \
5 & 6 \\\\ \
7 & 8 \\\\ \
9 & 10 \
\\end{bmatrix}\
$"
},
{ "correctness": "-",
"answer": "y:<br>\
$\
\\begin{bmatrix} \
1&2 \\\\ \
5 & 4 \\\\ \
5 & 6 \\\\ \
7 & 8 \\\\ \
9 & 10 \
\\end{bmatrix}\
$\
"
},
{ "correctness": "-",
"answer": "z:<br>\
$\
\\begin{bmatrix} \
1&3 \\\\ \
5 & 4 \\\\ \
5 & 6 \\\\ \
4 & 7 \\\\ \
9 & 10 \
\\end{bmatrix}\
$"
}
],
"hint": "Keep running the k-means algorithm until there is no change in the centroids.<br>\
Make sure you initialise your initial clusters with the centroids [0 1][4 4]",
"workings": "\
<br>\
Starting at centroids <br>\
$$\
\\begin{bmatrix} \
0&1 \\\\ \
4 & 4 \
\\end{bmatrix}\
$$<br>\
Apply k-means (k=2) until there is no change in the centroids. <br>\
At each iteration, group the values to the closest centroid, <br>\
then update the centroid with the mean of those values.\
<br>\
<b>x:</b> <br>\
Iteration 1:<br>\
x1 -> cluster 1<br>\
x2-x5 -> cluster 2<br>\
Recompute mean <br>\
$\
\\begin{bmatrix} \
1&2 \\\\ \
6&7 \
\\end{bmatrix}\
$<br>\
Iteration 2:<br>\
$\
\\begin{bmatrix} \
2&3 \\\\ \
7&8 \
\\end{bmatrix}\
$<br>\
Iteration 3:<br>\
$\
\\begin{bmatrix} \
2&3 \\\\ \
7&8 \
\\end{bmatrix}\
$<br>\
Stop, no change.<br>\
<br>\
<b>y:</b><br>\
Iteration 1:<br>\
$\
\\begin{bmatrix} \
1&3 \\\\ \
6.5&7 \
\\end{bmatrix}\
$<br>\
Iteration 2:<br>\
$\
\\begin{bmatrix} \
\\frac{11}{3}&\\frac{13}{3} \\\\ \
8&9 \
\\end{bmatrix}\
$<br>\
Iteration 3:<br>\
$\
\\begin{bmatrix} \
\\frac{11}{3}&\\frac{13}{3} \\\\ \
8&9 \
\\end{bmatrix}\
$<br><br>\
<b>z:</b><br>\
Iteration 1:<br>\
$\
\\begin{bmatrix} \
1&3 \\\\ \
5.75&6.75 \
\\end{bmatrix}\
$<br>\
Iteration 2:<br>\
$\
\\begin{bmatrix} \
3&3.5 \\\\ \
6&\\frac{23}{3} \
\\end{bmatrix}\
$<br>\
Iteration 3:<br>\
$\
\\begin{bmatrix} \
3&3.5 \\\\ \
6&\\frac{23}{3} \
\\end{bmatrix}\
$<br>\
<br>\
<b>w:</b><br>\
Iteration 1:<br>\
$\
\\begin{bmatrix} \
1&2 \\\\ \
6&6 \
\\end{bmatrix}\
$<br>\
Iteration 2:<br>\
$\
\\begin{bmatrix} \
2&3 \\\\ \
7&\\frac{22}{3} \
\\end{bmatrix}\
$<br>\
Iteration 3:<br>\
$\
\\begin{bmatrix} \
2&3 \\\\ \
7&\\frac{22}{3} \
\\end{bmatrix}\
$<br>",
"source": "Textbook 8.3",
"comments": "\
Candidate must have knowledge on the k-means algorithm, \
they have to perform this algorithm a significant number of times which \
justifies the higher difficulty. \
"
},
"7": {
"difficulty": "3",
"reference": "8.2",
"problem_type": "evaluation",
"answer_type": "blank_answer",
"question" : [ "Use k = 1 nearest neighbour with the labelled training data to test the testing data below.<br>\
Training Data <br>\
<table><tr><th>x</th><th>y</th><th>class</th></tr><tr><td>1</td><td>2</td><td>1</td></tr><tr><td>3</td><td>4</td><td>0</td></tr><tr><td>2</td><td>4</td><td>0</td></tr><tr><td>5</td><td>5</td><td>1</td></tr><tr><td>4</td><td>1</td><td>1</td></tr><tr><td>3</td><td>4</td><td>0</td></tr><tr><td>5</td><td>6</td><td>0</td></tr><tr><td>16</td><td>10</td><td>1</td></tr><tr><td>16</td><td>11</td><td>1</td></tr><tr><td>13</td><td>13</td><td>1</td></tr><tr><td>15</td><td>10</td><td>0</td></tr></table>\
Testing Data <br>\
<table><tr><th>x</th><th>y</th><th>class</th></tr><tr><td>10</td><td>10</td><td>0</td></tr><tr><td>5</td><td>3</td><td>0</td></tr><tr><td>3</td><td>2</td><td>1</td></tr><tr><td>4</td><td>4</td><td>0</td></tr><tr><td>9</td><td>8</td><td>1</td></tr><tr><td>16</td><td>15</td><td>1</td></tr></table>\
<bbr<Use two versions of k=1 nearest neighbour, each with different distance metrics, euclidean and Manhattan.<br>\
<br> After classifying each of the training points, record the results in a contingency matrix for \
each distance metric. <br>\
Which distance metric had the highest accuracy. Manhattan or Euclidean ? ", 1,"<br><br>",
"And by how much? (Give answer as fraction) ", 2, " <br>"],
// these are answers, a correct answer
// is indicated by a "+", incorrect answer has a "-"
"answers": [
{ "correctness": 1,
"answer": "Manhattan",
"explanation": "Manhattan had highest accuracy, see workings"
},
{ "correctness": 2,
"answer": "1/6",
"explanation": "2/3 (Manhattan accuracy) - 1/2 (Euclidean accuracy) = 1/6"
}
],
"hint": "Euclidean distance: $D(a,b) = (\\sum_{i=1}^{n} (a-b)^{2})^{1/2})$ <br>\
Manhattan Distance : $D(a,b) = \\sum_{i=1}^{n} |a-b|$ <br>\
With k=1 nearest neighbour, use the class of the nearest point in \
the model.\
",
"workings": "To calculate the class of the testing data I calculate the \
distance (using euclidean and Manhattan) for each of the training points \
as k = 1, I take the class of the point with the shortest distance. <br> \
The arrow -> points to the closest training item for the test point <br> \
This denotes class (x)<br> \
Euclidean:<br>\
10 10 -> 13 13 (1) <br>\
5 3 -> 5 5 (1) <br>\
3 2 -> 4 1 (1) <br>\
4 4 -> 2 4 (0) <br>\
9 8 -> 5 6 (0) <br>\
16 15 -> 13 13(1) <br>\
<br>\
Compare the predicted class to the actual class and <br>\
generate the contingency table:<br>\
$$\\begin{vmatrix} \
&Pred + &Pred - & \\\\ \
Actual +& 2 &1 & 3 \\\\ \
Actual -& 2 &1 & 3 \\\\ \
& 4& 2 & 6\
\\end{vmatrix}$$\
Accuracy = (TP+FP)/total = $(2+1)/6 = 1/2$ \
<br>\
Manhattan:<br>\
10 10 -> 15 10 (0) <br>\
5 3 -> 5 5 (1) <br>\
3 2 -> 1 2 (1) <br>\
4 4 -> 2 4 (0) <br>\
9 8 -> 5 6 (0) <br>\
16 15 -> 16 11(1) <br>\
<br>\
Generate the contingency table:<br>\
$$\\begin{vmatrix} \
&Pred + &Pred - & \\\\ \
Actual +& 2 &1 & 3 \\\\ \
Actual -& 1 &2 & 3 \\\\ \
& 3& 3 & 6\
\\end{vmatrix}$$\
Accuracy = (TP+FP)/total = $(2+2)/6 = 2/3$ \
",
"source": "Textbook 8.2",
"comments": "Candidate has to apply the k-nearest neighbour algorithm to a set of training data. \
The candidate has to understand how the algorithm classifies points and be able to apply two different \
distance metrics. <br>\
The candidate then also has to understand how to compare the results of the two distance metrics \
by calculating the ranking accuracy.\
"
},
"8": {
"difficulty": "3",
"reference": "11.2",
"problem_type": "evaluation",
"question" : "In this question we will see how boosting can alter the results of a simple classifier. \
In this question we will classify the training data below using a linear classifier. We will then apply the \
ensemble technique, Boosting until the algorithm aborts. <br>\
<br>\
Draw a contingency table, evaluating the training data for the linear classifier and boosted linear classifier. \
<br> \
Calculate the absolute difference in the new contingency table below. <br> \
<b> Training Data </b><br> \
<table><tr><th>x</th><th>y</th><th>class</th></tr><tr><td>2</td><td>1</td><td>1</td></tr><tr><td>2</td><td>3</td><td>1</td></tr><tr><td>-1</td><td>-3</td><td>1</td></tr><tr><td>2</td><td>2</td><td>1</td></tr><tr><td>-1</td><td>1</td><td>0</td></tr><tr><td>-1</td><td>-3</td><td>0</td></tr><tr><td>-2</td><td>-3</td><td>0</td></tr><tr><td>-2</td><td>1</td><td>0</td></tr></table>\
",
// these are answers, a correct answer