forked from mkem114/COMPSYS-304-Assignment-3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
assignment3_answer.tex
132 lines (126 loc) · 8.6 KB
/
assignment3_answer.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
\documentclass[10pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{mdframed}
\usepackage{pgfplots}
\usepackage{amsmath}
\usepackage{fancyhdr}
\usepackage{environ}
\usepackage{amsfonts}
\usepackage{siunitx}
\usepackage{lastpage}
\usepackage{titling}
\usepackage{amssymb}
\usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry}
\pgfplotsset{every tick label/.append style={font=\small}}
\author{mkem114 - 6273632}
\title{CS304 - Assignment 3 Report}
\pagestyle{fancy}
\renewcommand{\footrulewidth}{0.4pt}
\cfoot{Page \thepage/\pageref{LastPage}}
\chead{\thetitle}
\lfoot{\theauthor}
\rfoot{\today}
\begin{document}
\maketitle
\section{Cache measurement}
\subsection{Processor name}
\begin{verbatim}
Intel Core i7 7500U
\end{verbatim}
\subsection{Cache sizes}
There is no L4 cache in this processor, L1 is split by data and instruction. L3 is the only shared cache, there is two of the L2 and both L1s for each physical core. For sizes of each cache refer to Figure ~\ref{cachesize}
\begin{figure}[h!]
\begin{mdframed}
\begin{verbatim}
LEVEL1_ICACHE_SIZE 32768
LEVEL1_DCACHE_SIZE 32768
LEVEL2_CACHE_SIZE 262144
LEVEL3_CACHE_SIZE 4194304
\end{verbatim}
\end{mdframed}
\caption{Extract from ``getconf -a | grep CACHE\_SIZE''}
\label{cachesize}
\end{figure}
\subsection{Cache measurement table}
\begin{figure}[h!]
\begin{center}
\begin{tabular}{|r|r|r|r|}
\hline
$N$ & size of $a$ / bytes& time per iteration / ns & time per iteration / ns \\ \hline
& & linear access & random access \\ \hline
\num[group-separator={,}]{ 2048 } & \num[group-separator={,}]{ 8192 } & 2.4 & 2.38 \\ \hline
\num[group-separator={,}]{ 4096 } & \num[group-separator={,}]{ 16384 } & 2.38 & 2.51 \\ \hline
\num[group-separator={,}]{ 8192 } & \num[group-separator={,}]{ 32768 } & 2.32 & 2.59 \\ \hline
\num[group-separator={,}]{ 16384 } & \num[group-separator={,}]{ 65536 } & 2.33 & 2.96 \\ \hline
\num[group-separator={,}]{ 32768 } & \num[group-separator={,}]{ 131072 } & 2.33 & 3.94 \\ \hline
\num[group-separator={,}]{ 65536 } & \num[group-separator={,}]{ 262144 } & 2.33 & 4.52 \\ \hline
\num[group-separator={,}]{ 131072 } & \num[group-separator={,}]{ 524288 } & 2.57 & 6.19 \\ \hline
\num[group-separator={,}]{ 262144 } & \num[group-separator={,}]{ 1048576 } & 2.76 & 16.61 \\ \hline
\num[group-separator={,}]{ 524288 } & \num[group-separator={,}]{ 2097152 } & 2.87 & 33.65 \\ \hline
\num[group-separator={,}]{ 1048576 } & \num[group-separator={,}]{ 4194304 } & 2.95 & 37.70 \\ \hline
\num[group-separator={,}]{ 2097152 } & \num[group-separator={,}]{ 8388608 } & 2.98 & 41.83 \\ \hline
\num[group-separator={,}]{ 4194304 } & \num[group-separator={,}]{ 16777216 } & 3.00 & 43.49 \\ \hline
\num[group-separator={,}]{ 8388608 } & \num[group-separator={,}]{ 33554432 } & 3.01 & 45.04 \\ \hline
\num[group-separator={,}]{ 16777216 } & \num[group-separator={,}]{ 67108864 } & 3.01 & 45.28 \\ \hline
\end{tabular}
\end{center}
\caption{Results of running cachetest1 and cachetest2}
\label{cacheresults}
\end{figure}
\subsection{Cache measurement chart}
See Figure ~\ref{cacheresultsone} and Figure ~\ref{cacheresultstwo}
\begin{figure}[h!]
\begin{center}
\begin{tikzpicture}[scale=1]
\begin{axis}[
ylabel = time per iteration in ns,
xlabel = size of $a$ in bytes,
xtick=data,
xticklabels={8K,,32K,,128K,,0.5M,,2M,,8M,,32M},
xmode=log
]
\addplot table [x=a, y=first, col sep=comma] {CS304A3.csv};\
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Time per iteration in nanoseconds over size of $a$ in bytes for cachetest1}
\label{cacheresultsone}
\end{figure}
\begin{figure}[h!]
\begin{center}
\begin{tikzpicture}[scale=1]
\begin{axis}[
ylabel = time per iteration in ns,
xlabel = size of $a$ in bytes,
xtick=data,
xticklabels={8K,,32K,,128K,,0.5M,,2M,,8M,,32M},
xmode=log
]
\addplot table [x=a, y=second, col sep=comma] {CS304A3.csv};
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Time per iteration in nanoseconds over size of $a$ in bytes for cachetest2}
\label{cacheresultstwo}
\end{figure}
\subsection{Differences and similarities}
\begin{itemize}
\item From 8K till 256K the curves in Figure ~\ref{cacheresultsone} and Figure ~\ref{cacheresultstwo} is almost the same except for the little start which will be because random access will lead to better loading on to L2 than linear. Normally everything is attempted to fit only L1 which is what happens with sequential. After the little dip there is very little change or rise in the curve because it starts just at a full L1 and goes till L2 approaches filling.
\item The random access and sequential have the same trend curve shape. 512K till 8M because it is to do with the constant cache sizes of the procesor. At 512K (the first stop after exhausting L2 cache) the L2 cache is always full which is why it starts to take more time quite rapidly. The curve significantly slows from 2M till 8M as that is when the L3 cache starts to fill up and run out of space. The time penalty starts to stabilise because the L3 demand starts to stabilise at it approaches 100\% use. The curve is so step because of the performance difference between L2 and L3.
\item Both random access and sequential have the same trend curve shape. 16M onwards is four times the L3 cache, the time per iteration in ns won't go up because now almost everything will be memory only. The time penalty for accessing memory is constant per iteration as the percentage of cache misses stabilises so does the average time per iteration..
\item Overall except for when both are run on a $8K$ array, cachetest2 is much slower because the $a$ matrix has to be indexed randomly. Loading $a$ in to the cache doesn't make much sense because it's too big for L3, it has to be read from memory and hence the relative time penalty factor.
\end{itemize}
\section{Matrix product}
As per the assignment specification the matrices were all of size $1000\times1000$ and to get some more accuracy there was 5 repetitions run. Times given have already been divided by 5.
\subsection{Straight forward implementation}
Took $3.26s$.\\
This implementation requires indexing one of the two matrices to read by rows because the whole column is required to calculate every element of the result matrix. This is slowed down because the speed advantages of cache aren't fully utilised (sequential elements aren't loaded into cache lines) due changing the row index $N$ more times than the column index when 2D matrices in C are indexed by row then by column. \\
\subsection{Temporary matrix}
Took $1.54s$.
\subsection{Blocking and temporary matrix}
Took $1.07s$.\\
Note $k=16$ blocks were found to be the fastest for me.\\
Instead of element-wise multiplication of a matrix's row and another's column for every element on a resulting matrix; taking a $k\times k$ block of one matrix, summing the element-wise multiplication of it with a $k*N$ subsection of the other matrix then combing the sums in the resulting matrix there is a performance boost.\\
This boost arises because the $k\times k\times size=16\times 16\times 8=2048$bytes block, the value of each element $8$bytes can both fit inside L1 cache and stay in there until they are never needed again. The stripe of the other matrix being multiplied $k\times n\times size=16\times 1000\times 8=128000$ bytes can fit inside the L2 cache until it's never needed again thus all of the matrices' elements never have to be loaded from memory more than once.
\end{document}