-
Notifications
You must be signed in to change notification settings - Fork 0
/
a1.html
339 lines (305 loc) · 13.6 KB
/
a1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" lang="en">
<head>
<title>CS-240 Assignment 1: Sequential Map/Reduce</title>
<link rel="stylesheet" type="text/css" href="style.css" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
</script>
</head>
<body>
<h1>CS-240 Assignment 1: Sequential Map/Reduce</h1>
<div id="topnavbar">
<ul class="topnavlist">
<li><a href="index.html">CS-240 Home</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li><a href="assignments.html">Assignments</a></li>
<li><a href="announcements.html">Announcements</a></li>
<li><a href="https://piazza.com/kaust.edu.sa/fall2017/cs240/home">Q & A</a></li>
</ul>
</div>
<h2>Introduction</h2>
<p>
In the first two assignments, you will build a Map/Reduce library
as a way to learn the Go programming language and as a way to learn
about fault tolerance in distributed systems. For Assignment 1, you
will work with a sequential Map/Reduce implementation and write a
sample program that uses it.
The interface to the library is similar to the one described in
the original <a href="papers/map-reduce.pdf">MapReduce paper</a>.
</p>
<h2>Software</h2>
<p>
You'll implement this assignment (and all the assignments) in
<a href="http://www.golang.org/">Go</a>. The Go web site
contains lots of tutorial information which you may want to
look at.
</p>
<p>
For each assignment, we will provide you with a significant amount of scaffolding code to get started.
You can download a tarball containing the provided code for assignment 1 <a href="assignments/assignment1.tar.gz">here</a>.
We expect that it is likely to work on your own development environment that supports Go.
You can use the <a href="setup.html">instructions</a> provided to configure your own environment.
</p>
<p>
For the first two assignments, we supply you with parts of a flexible
MapReduce implementation. It has support for two
modes of operation, <em>sequential</em> and
<em>distributed</em>. Assignment 1 deals with the former. The map and reduce tasks
are all executed in serial: the first map task is
executed to completion, then the second, then the third, etc.
When all the map tasks have finished, the first reduce task is
run, then the second, etc. This mode, while not very fast, can
be very useful for debugging, since it removes much of the
noise seen in a parallel execution. The sequential mode also
simplifies or eliminates various corner cases of a distributed system.
</p>
<h3>Getting familiar with the source</h3>
<p>
The mapreduce package (located at <tt>$GOPATH/src/mapreduce</tt>) provides a simple Map/Reduce library with
a sequential implementation. Applications would normally call
<tt>Distributed()</tt> — located in <tt>mapreduce/master.go</tt> — to start a job, but may
instead call <tt>Sequential()</tt> — also in <tt>mapreduce/master.go</tt> — to get a
sequential execution, which will be our approach in this assignment.
</p>
<p>
The flow of the mapreduce implementation is as follows:
<ol>
<li>
The application provides a number of input files, a map
function, a reduce function, and the number of reduce
tasks (<tt>nReduce</tt>).
</li>
<li>
A master is created with this knowledge. It spins up an
RPC server (see <tt>mapreduce/master_rpc.go</tt>), and waits for
workers to register (using the RPC call
<tt>Register()</tt> defined in <tt>mapreduce/master.go</tt>).
As tasks become available, <tt>schedule()</tt> — located in <tt>mapreduce/schedule.go</tt> —
decides how to assign those tasks to workers, and how to
handle worker failures.
</li>
<li>
The master considers each input file one map task, and
makes a call to <tt>doMap()</tt>
in <tt>mapreduce/common_map.go</tt> at least once for each task. It
does so either directly (when using
<tt>Sequential()</tt>) or by issuing the <tt>DoTask</tt>
RPC — located in <tt>mapreduce/worker.go</tt> — on a worker. Each call to
<tt>doMap()</tt> reads the appropriate file, calls the
map function on that file's contents, and produces
<tt>nReduce</tt> files for each map file. Thus, after all map
tasks are done, the total number of files will be the product of the
number of files given to map (<tt>nIn</tt>) and <tt>nReduce</tt>.
<pre>f0-0, ..., f0-[nReduce-1],
...
f[nIn-1]-0, ..., f[nIn-1]-[nReduce-1].</pre>
</li>
<li>
The master next makes a call to <tt>doReduce()</tt>
in <tt>mapreduce/common_reduce.go</tt> at least once for each
reduce task. As with <tt>doMap()</tt>, it does so either
directly or through a worker. <tt>doReduce()</tt>
collects corresponding files from each map result
(e.g. <tt>f0-i, f1-i, ... f[nIn-1]-i</tt>), and runs the reduce function
on each collection. This process produces <tt>nReduce</tt> result
files.
</li>
<li>
The master calls <tt>mr.merge()</tt>
in <tt>mapreduce/master_splitmerge.go</tt>, which merges all the
<tt>nReduce</tt> files produced by the previous step
into a single output.
</li>
<li>
The master sends a Shutdown RPC to each of its workers,
and then shuts down its own RPC server.
</li>
</ol>
You should look through the files in the MapReduce implementation,
as reading them might be useful to understand how the other methods
fit into the overall architecture of the system hierarchy. However,
for this assignment, you will write/modify <strong>only</strong> <tt>doMap</tt>
in <tt>mapreduce/common_map.go</tt>, <tt>doReduce</tt> in <tt>mapreduce/common_reduce.go</tt>,
and <tt>mapF</tt> and <tt>reduceF</tt> in <tt>main/wc.go</tt>. You will not be able to submit
other files or modules.
</p>
<h2>Part I: Map/Reduce input and output</h2>
<p>
The Map/Reduce implementation you are given is missing some
pieces. Before you can write your first Map/Reduce function
pair, you will need to fix the sequential implementation. In
particular, the code we give you is missing two crucial
pieces: the function that divides up the output of a map task,
and the function that gathers all the inputs for a reduce task.
These tasks are carried out by the <tt>doMap()</tt> function in
<tt>mapreduce/common_map.go</tt>, and the <tt>doReduce()</tt> function in
<tt>mapreduce/common_reduce.go</tt> respectively. The comments in those
files should point you in the right direction.
</p>
<ul class="hints">
<li>
<b>Hint:</b> you might find Go's json package to be useful (<a href="https://blog.golang.org/json-and-go">https://blog.golang.org/json-and-go</a>)
</li>
</ul>
<p>
To help you determine if you have correctly implemented
<tt>doMap()</tt> and <tt>doReduce()</tt>, we have provided you
with a Go test suite that checks the correctness of your
implementation. These tests are implemented in the file
<tt>test_test.go</tt>. To run the tests for the sequential
implementation that you have now fixed, follow this (or a non-<tt>bash</tt> equivalent) sequence of shell commands,
starting from the cs240 top-level directory (that is, the top level of the hierarchy contained in
the tarball you downloaded above):
<pre>
$ cd 240 # or whatever you have called the directory that you unpacked from the tarball
$ export GOPATH="$PWD" # go needs $GOPATH to be set to the project's directory, above src.
$ cd "$GOPATH/src/mapreduce"
$ go test -run Sequential mapreduce/...
ok mapreduce 4.515s
</pre>
<p>
If the output did not show <em>ok</em> next to the tests, your
implementation has a bug in it. To give more verbose output,
set <tt>debugEnabled = true</tt> in <tt>mapreduce/common.go</tt>, and add
<tt>-v</tt> to the test command above. You will get much more
output along the lines of:
<pre>
$ go test -v -run Sequential
=== RUN TestSequentialSingle
master: Starting Map/Reduce task test
Merge: read mrtmp.test-res-0
master: Map/Reduce task completed
--- PASS: TestSequentialSingle (2.30s)
=== RUN TestSequentialMany
master: Starting Map/Reduce task test
Merge: read mrtmp.test-res-0
Merge: read mrtmp.test-res-1
Merge: read mrtmp.test-res-2
master: Map/Reduce task completed
--- PASS: TestSequentialMany (2.32s)
PASS
ok mapreduce4.635s
</pre>
<h2>Part II: Single-worker word count</h3>
<p>
Now that the map and reduce tasks are connected, we can start
implementing some interesting Map/Reduce operations. For this
assignment, we will be implementing word count — a simple and
classic Map/Reduce example. Specifically, your task is to
modify <tt>mapF</tt> and <tt>reduceF</tt> within <tt>main/wc.go</tt>
so that the application reports the number of occurrences of each word.
A word is any contiguous sequence of letters, as
determined by
<a href="http://golang.org/pkg/unicode/#IsLetter"><tt>unicode.IsLetter</tt></a>.
</p>
<p>
There are some input files with pathnames of the form <tt>pg-*.txt</tt> in
the <tt>main</tt> directory, downloaded from <a
href="https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads">Project
Gutenberg</a>.
This is the result when you initially try to compile the code we provide you
and run it:
<pre>
$ cd "$GOPATH/src/main"
$ go run wc.go master sequential pg-*.txt
# command-line-arguments
./wc.go:14: missing return at end of function
./wc.go:21: missing return at end of function</pre>
</p>
<p>
The compilation fails because we haven't written a complete map
function (<tt>mapF()</tt>) nor a complete reduce function
(<tt>reduceF()</tt>) in <tt>wc.go</tt> yet. Before you start
coding read Section 2 of the
<a href="http://research.google.com/archive/mapreduce-osdi04.pdf">MapReduce paper</a>.
Your <tt>mapF()</tt> and <tt>reduceF()</tt> functions will
differ a bit from those in the paper's Section 2.1. Your
<tt>mapF()</tt> will be passed the name of a file, as well as
that file's contents; it should split it into words, and return
a Go slice of key/value pairs, of type
<tt>mapreduce.KeyValue</tt>. Your <tt>reduceF()</tt> will be
called once for each key, with a slice of all the values
generated by <tt>mapF()</tt> for that key; it should return a
single output value.
</p>
<ul class="hints">
<li>
<b>Hint:</b> a good read on what strings are in Go is the
<a href="http://blog.golang.org/strings">Go Blog on strings</a>.
</li>
<li>
<b>Hint:</b> you can use
<a href="http://golang.org/pkg/strings/#FieldsFunc"><tt>strings.FieldsFunc</tt></a>
to split a string into components.
</li>
<li>
<b>Hint:</b> the strconv package
(<a href="http://golang.org/pkg/strconv/">http://golang.org/pkg/strconv/</a>)
is handy to convert strings to integers etc.
</li>
</ul>
<p>
You can test your solution using:
<pre>
$ cd "$GOPATH/src/main"
$ go run wc.go master sequential pg-*.txt
master: Starting Map/Reduce task wcseq
Merge: read mrtmp.wcseq-res-0
Merge: read mrtmp.wcseq-res-1
Merge: read mrtmp.wcseq-res-2
master: Map/Reduce task completed</pre>
The output will be in the file <tt>mrtmp.wcseq</tt>.
We will test your implementation's correctness with the following command,
which should produce the following top 10 words:
<pre>
$ sort -n -k2 mrtmp.wcseq | tail -10
he: 34077
was: 37044
that: 37495
I: 44502
in: 46092
a: 60558
to: 74357
of: 79727
and: 93990
the: 154024</pre>
(this sample result is also found in main/mr-testout.txt)
</p>
<p>
You can remove the output file and all intermediate files with:
<pre>$ rm mrtmp.*</pre>
</p>
<p>
To make testing easy for you, from the <tt>$GOPATH/src/main</tt> directory, run:
<pre>$ sh ./test-wc.sh</pre>
and it will report if your solution is correct or not.
</p>
<h2>Submission</h2>
<p>
Submit your code via blackboard <a href="https://blackboard.kaust.edu.sa">CS240 Assignment 1</a>. We expect your submission to only contain three files: <tt>common_map.go</tt>, <tt>common_reduce.go</tt>, and <tt>wc.go</tt>.
You may submit multiple times, only the one in blackboard at the time of grading will be recorded.
</p>
<p>
Before submitting, please run the full tests given above for both parts one final time. <b>You</b> are responsible for making sure your code works.
</p>
<p>
You will receive full credit for Part I if your software passes
the Sequential tests (as run by the <tt>go test</tt> commands above) on our servers.
You will receive full credit for Part II if your Map/Reduce word count
output matches the correct output for the sequential
execution above when run on our servers.
</p>
<p>
The final portion of your credit is determined by code quality tests, using the standard tools <tt>gofmt</tt> and <tt>go vet</tt>.
You will receive full credit for this portion if all files submitted conform to the style standards set by <tt>gofmt</tt> and the report from <tt>go vet</tt> is clean for your mapreduce package (that is, produces no errors).
If your code does not pass the <tt>gofmt</tt> test, you should reformat your code using the tool. You can also use the <a href="https://github.com/qiniu/checkstyle">Go Checkstyle</a> tool for advice to improve your code's style, if applicable. Additionally, though not part of the graded cheks, it would also be advisable to produce code that complies with <a href="https://github.com/golang/lint">Golint</a> where possible.
</p>
<h2>Acknowledgements</h2>
<p>This assignment is adapted from Princeton's COS-418, and previously from MIT's 6.824 course. Thanks to Mike Freedman, Frans Kaashoek, Robert Morris, and Nickolai Zeldovich for their support.</p>
<hr />
<p class="lastupdated">Last updated: 2017-08-18 11:24:15</p>
</body>
</html>