Skip to content

Latest commit

 

History

History
555 lines (408 loc) · 20.1 KB

readme.md

File metadata and controls

555 lines (408 loc) · 20.1 KB

Overview

This is due at:

22:00, 14th Nov 2016

This exercise looks at less regular types of computation, and looks at situations where you might have to re-organise the code in order to make it faster.

This is another new coursework for this year, and is designed to act as a bit of a bridge between the highly structured CW1-CW3, and the more open-ended CW4 and CW5. It replaces the amazing makefile/pipeline coursework, which was great (I think), except that it was too time-consuming (plus lots of students had difficulty implementing FIRs). I've run through it a couple of times, and adjusted it down for time (made the GPU part at the end optional). Let me know where there are problems in the spec.

The overall application domain is a very simple multi-layer perceptron network. We only look at the feed-forward part, i.e. classification part. At each layer a vector of data is presented, the layer transforms it into a new vector of data. The output of one layer goes into the input of the next layer, and so on.

Exploring the application

The purposes of this course is to accelerate and optimise things even if you don't fully understand them, so don't worry if you've never looked at neural networks. It is better to look at the code. So first build all the tools:

make tools

This should leave you with a number of programs in the bin directory:

- `bin/run_network` : This takes a list of layers (encoded
    as binary files) then runs data through them.

- `bin/generate_sparse_layer` : This creates an instance of
    a layer with a given input and output size, and writes
    it to stdout as a binary file.
    
- `bin/print_layer_as_text` : Takes a binary layer description
    and prints it as human readable text.
    
- `bin/print_network_as_dot` : Takes a list of layers (binary
   descriptions) and prints them in the [GraphViz/dot](http://www.graphviz.org/) format.

In order to get a feel for this, create a layer with 8 inputs and 12 outputs:

mkdir -p working
bin/generate_sparse_layer 8 12 > w/layer_8_12.bin

This should have generated a file called w/layer_8_12.bin, which you can view as text:

bin/print_layer_as_text < w/layer_8_12.bin

You should see some information about the nework, and in particular the list of synapses, or weights. There will be many lines looking like this:

   98 : 10 <- 6 (0.660889)

this gives the connectivity of the graph. In this case it is saying that synapse 98 takes input 6 and transfers it to output 10 with weight 0.660889. This particular layer is sparse, which means that even though there are 8*12=96 possible paths, only 9 of them exist, so the layer has a sparsity factor of 0.09375. Sparsity is often useful for reducing compute load, but the resulting irregularity makes it more difficult to optimise.

Optional: We can also view the network as a graph if GraphViz is installed. First generate a dot file from the graph:

bin/print_network_as_dot w/layer_8_12.bin > w/layer_8_12.dot

then render it as a png:

dot w/layer_8_12.dot -Tpng > w/layer_8_12.png

This results in the following image on my machine:

8x12 Layer

In this instance it is so sparse that some neurons aren't connected, but others have multiple connections.

A layer is applied to data by turning it into an instance of the Layer class, found in include/layer.hpp. Each instance of Layer contains the set of connections for one layer, and is able to execute that layer through Layer::execute. The reference version of Layer::execute is given by the class SimpleLayer, which is in src/layers/simple_layer.hpp.

Looking in SimpleLayer::execute, the key code is:

void execute(
    const int8_t *pIn,  // Values of input neurons in -127..+127
    int8_t *pOut        // Values of output neurons in -127..+127
) const
{        
    // Create a working vector of nOut values
    std::vector<int32_t> acc(m_nOut, 0); 
    
    // Loop over all synapses (edges)
    for(unsigned i=0; i<m_synapses.size(); i++){
        int32_t contrib = m_synapses[i].weight * pIn[ m_synapses[i].src];
        acc[ m_synapses[i].dst ] += contrib >> (23-16);
    }
    
    // Transform all outputs
    for(unsigned i=0; i<m_nOut; i++){
        pOut[i] = sigmoid( acc[i] ); // compress with sigmoid function
    }
}

There are two loops in here:

  • Iterate over the synapses to transfer contributions from the input layer to the output layer.

  • Compress the accumulated input values to a value between 0 and 1 using a sigmoid function, in this case sigmoid(x) = 1/(1+exp(-x)).

This example uses fixed-point rather than floating-point, as it makes it easier to check whether we are getting the right results. If we used floating-point then the order of execution matters, but with fixed-point it should always be the same.

To run the application, generate 1024 bytes of random input data (updated as suggested by @ awai54st) :

cat /dev/urandom | head -c 1024 > w/random1024.bin

You can then run the filter as a single layer network:

cat w/random1024.bin | bin/run_network w/layer_8_12.bin > w/ref1024.bin

You'll have ended up with a 1536-byte output file, because the output layer is larger than the input size. Internally bin/run_network will cut the data up into batches that match the size of the data.

Parallelising with Pipes

The idea is to have multiple layers, not just one layer, and this presents a performance bottleneck. Generate 8 layers, each of size 16x16:

for i in $(seq -w 0 16); do
    mkdir -p w
    bin/generate_sparse_layer 16 16 > w/n16_${i}.bin;
done

You should have ended up with 16 files, which can then be used as a network:

cat w/random1024.bin | bin/run_network \
    w/n16_00.bin w/n16_01.bin w/n16_02.bin w/n16_03.bin \
    w/n16_04.bin w/n16_05.bin w/n16_06.bin w/n16_07.bin \
    w/n16_08.bin w/n16_09.bin w/n16_10.bin w/n16_11.bin \
    w/n16_12.bin w/n16_13.bin w/n16_14.bin w/n16_15.bin \
    > w/ref_n16_1024.bin

This will have run the same data through a 16 layer network, rather than just 1.

Optional: We can generate a dot graph of this network:

bin/print_network_as_dot \
    w/n16_00.bin w/n16_01.bin w/n16_02.bin w/n16_03.bin \
    w/n16_04.bin w/n16_05.bin w/n16_06.bin w/n16_07.bin \
    w/n16_08.bin w/n16_09.bin w/n16_10.bin w/n16_11.bin \
    w/n16_12.bin w/n16_13.bin w/n16_14.bin w/n16_15.bin \
    > w/n16.dot

then render it to a png:

dot w/n16.dot -Tpng > w/n16.png

I get this mess:

n16

The overall computational goal is to make the networks deeper and wider, so in general we want more layers, and each layer should be bigger.

Task: Create a script called scripts/create_n512.sh with the following contents:

#!/bin/bash
for i in $(seq -w 0 32); do
    mkdir -p w
    bin/generate_sparse_layer 512 512 > w/n512_${i}.bin;
done

If you run this script it should generate 32 layers, each of which are 512 wide.

Task: Create a script called scripts/run_n512_P1.sh with the following contents:

#!/bin/bash
bin/run_network \
    w/n512_00.bin w/n512_01.bin w/n512_02.bin w/n512_03.bin \
    w/n512_04.bin w/n512_05.bin w/n512_06.bin w/n512_07.bin \
    w/n512_08.bin w/n512_09.bin w/n512_10.bin w/n512_11.bin \
    w/n512_12.bin w/n512_13.bin w/n512_14.bin w/n512_15.bin \
    w/n512_16.bin w/n512_17.bin w/n512_18.bin w/n512_19.bin \
    w/n512_20.bin w/n512_21.bin w/n512_22.bin w/n512_23.bin \
    w/n512_24.bin w/n512_25.bin w/n512_26.bin w/n512_27.bin \
    w/n512_28.bin w/n512_29.bin w/n512_30.bin w/n512_31.bin

Note that this script provides no input and no output for bin/run_network, so it will come from the input and output of the script. To run it, generate a bigger chunk of input data:

cat /dev/urandom | head -c 1048576 > w/random1M.bin

and then pipe it through:

cat w/random1M.bin | scripts/run_n512_P1.sh > /dev/null

Depending on your system, this might take a while (up to minutes). To get a sense of what is going on, it can be useful to use the tool pv, which displays progress of data through the stream:

cat w/random1M.bin | scripts/run_n512_P1.sh | pv > /dev/null

This shows you the bandwidth of the data flowing out of the pipeline, as well as total data that has passed through. We could alternatively have placed pv at the beginning:

cat w/random1M.bin | pv | scripts/run_n512_P1.sh > /dev/null

which for the P1 version will make little difference.

Task: Create a script called scripts/run_n512_P2.sh with the following contents:

#!/bin/bash
bin/run_network \
    w/n512_00.bin w/n512_01.bin w/n512_02.bin w/n512_03.bin \
    w/n512_04.bin w/n512_05.bin w/n512_06.bin w/n512_07.bin \
    w/n512_08.bin w/n512_09.bin w/n512_10.bin w/n512_11.bin \
    w/n512_12.bin w/n512_13.bin w/n512_14.bin w/n512_15.bin \
| bin/run_network \
    w/n512_16.bin w/n512_17.bin w/n512_18.bin w/n512_19.bin \
    w/n512_20.bin w/n512_21.bin w/n512_22.bin w/n512_23.bin \
    w/n512_24.bin w/n512_25.bin w/n512_26.bin w/n512_27.bin \
    w/n512_28.bin w/n512_29.bin w/n512_30.bin w/n512_31.bin

Be careful about the line endings - each of the \ characters at the end of the line means that the command continues, and the pipe | character is joining them together.

Run your new pipeline, and see if the performance is any different:

cat w/random1M.bin | scripts/run_n512_P2.sh | pv > /dev/null

You can also look at the result with time to see what is going on, and also may want to look at your task manager (e.g. top or htop).

Task: Create a script called scripts/run_n512_P4.sh, which splits the network into four stages.

Task: Create a script called scripts/run_n512_P8.sh, which splits the network into eight stages.

Task: Create a script called scripts/run_n512_P16.sh, which splits the network into sixteen stages.

Task: Create a script called scripts/run_n512_P32.sh, which splits the network into 32 stages.

Hopefully you'll see different results, and depending on your number of cores, you should see pretty decent speed-ups. Note that it depends on the amount of data flowing through - if there is not enough data, then eventually the speed-up will reduce.

Task: Generate a plot of P (i.e. which script was used) versus sustained bandwidth (i.e. bytes/sec) for these scripts on a c4.8xlarge AWS instance. Note that you may want to leave this till all of your development work is done, so that you don't keep starting and stopping instances. Save the plot as results/pipeline_p_vs_bandwidth.pdf.

The takeaway lesson from this should be that pipelines can be quite effective at doing load-balancing and parallelising sequential components. It is often easier to split a task up into parallel stages, rather than trying to parallelise an individual stage. It's also possible to put things like decompressors and compressors on the front and end of the pipeline, reducing the size of the data on disk.

Parallelising with TBB

While pipeline parallelism in the shell can work well, there is a lot of overhead due to moving data between processes, so it is still worthwhile exploring how to make an irregular application faster using our current methods. At the moment we've only looked at simple dense iteration spaces, so some extra thought will be needed for this application.

Look at the existing implementation of SimpleLayer::execute, and consider the two loops:

  • Is there any data-sharing between iterations?

  • Do any iterations write to the same locations?

  • Can they be parallelised?

(This is another "in your head" thing, you don't need to write it down).

Task: create a new layer engine called par_for_naive which parallelises both loops without regard to correctness:

  • copy the implementation src/layers/simple_layer.cpp into a new file called src/layers/par_for_naive_layer.cpp.

  • Change the class name from SimpleLayer to ParForNaiveLayer

  • Adjust the factory function at the bottom of the file (i.e. CreateSimpleLayer -> CreateParForNaiveLayer.

  • Go into src/layers/layer_factory.cpp, and make sure the new engine is created if the engine type is "par_for_naive".

  • Parallelise the two loops.

Once you have got this setup, you should be able to do:

cat w/random1024.bin | \
   bin/run_network w/n512_00.bin:par_for_naive \
   > w/random1024_par_for_naive.out

in order to select the new engine (note the :par_for_naive suffix on the layer), or:

cat w/random1024.bin | \
   bin/run_network w/n512_00.bin:simple \
   > w/random1024_simple.out

to select the original (default) engine.

The diff command can be used to tell if two files are different. Use it to convince yourself that w/random1024_par_for_naive.out and w/random1024_simple.out are different (or at least that sometimes they are different).

Parallelising Using Atomics

There is a clear problem in the first for loop, where multiple tasks are writing to the acc array at the same time. One solution is to make the accesses atomic. Fortunately we are dealing with integer data, rather than floating-point, so you can just replace:

std::vector<int32_t> acc(m_nOut, 0);

with:

std::vector<tbb::atomic<int32_t> > acc(m_nOut, 0);

Task: create a new implementation called ParForAtomic with engine name par_for_atomic, and make the acc array atomic.

We'll benchmark them all at the end, but it is worth looking at how different the speed is, even informally. On my four-core machine I found the atomic version was about half the speed of the naive version.

The cost of the atomics can be attributed to two elements:

  • An atomic add has a fixed penalty over a non-atomic add, even if there are no conflicting writes

  • There is a variable penalty which rises as parallel tasks tend to "fight" over access to the cache line holding particular values. This tends to get worse with more cores.

Rewriting to Expose Parallelism

The best way of parallelising irregular application is to transform them into a more regular form of safe parallelism. The algorithm as written combines two kinds of parallelism:

  • edge oriented parallelism, which gives an iteration space over the edge array. This is problematic because each output neuron maps onto multiple edges, so the iterations are not independent.

  • output oriented parallelism, where we loop over the acc array and map each element onto a single element of pOut.

If we could restructure the algorithm so that both loops mapped onto the output iteration space, then we could dispense with the locks.

Task: sketch the data-dependencies found in the execute function, including reads, writes, and operations. Save the graph as results/dependency_sketch.pdf.

You can use a very small layer with only a handful of nodes. I don't care how you do this, or what form it is in. Hand-drawn and photographed is fine as long as it is readable, and the file isn't too big (e.g. not more than a few hundred KB), and it is a pdf. I mainly want some evidence that you have thought about it, and it may give us something to talk about in the oral.

Task: draw a sketch of the "cone" of dependencies associated with one point in the output iteration space, including only those parts needed to calculate that output. So if you considere one output neuron as the "tip" of the cone, then the rest of the cone sweeps backwards to the relevant input neurons. Save the graph as results/output_dependency_cone.pdf.

For both these sketches you may be confused as to what exactly I'm asking for, but I'm not asking for anything specific. I certainly have an idea about what I would draw, but there are many other valid possibilities. All I want is something that captures something useful about the problem and how to solve it.

Hopefully the process of thinking about it makes it clear how to transform the problem such that you can parallelise over the output space.

Task: create a layer engine called ClusteredLayer which uses a single for loop over the outputs. Note that there will be another nested loop within the outer for loop, but the execute function will only directly contain one loop.

Hint: you probably want to pre-calculate information derived from the synapse vector, rather than using the synapse vector directly. What other data-structures could you build in the constructor?

Task: create another engine called ParForClusteredLayer which adds the parallel_for loop.

The reason for doing the sequential version then parallelising it is to make sure the logic is right. Try not to skip ahead until you're reasonably sure the sequential one works.

Evaluating performance

Use an AWS c4.8xlarge instance to evaluate the performance of the different implementations:

  • simple

  • par_for_naive

  • par_for_atomic

  • clustered

  • par_for_clustered

Task: produce a plot exploring execution time (y) against input ratio (input-layer-size / output-layer-size) for a single layer, called results/single_layer_ratio_vs_time.pdf.

By ratio I do mean nIn/nOut - the reason is that it changes the amount of work per output item, and also the amount of atomic contention.

Task: produce a plot which uses a single layer with nIn==nOut, and explores scaling of n (x-axis) versus type (y-axise), called results/single_layer_n_vs_time.pdf.

Submission

Your final github push before 22:00 on 14th of Nov is considered to be your submitted version. I'll be pulling them from there directly.

If you wish to submit something to blackboard, then feel free to commit the text hash of the relevant commit.

Optional: Mapping the clustered version to a GPU

Note: this is an optional suggested extra, as it makes sure that you have a very good handle on the OpenCL APIs. It took it out as a required element, as some people (not students) thought it was a bit complicated or time-consuming for this exercise.

Once you have the clustered version, you have an algorithm which works over the outputs with a single parallel_for loop. This means it has also transformed it into a form which will work in a GPU. However, you will need to think carefully about data-movement:

  • Any data-structures about the graph will need to be built and copied to the GPU in the constructor, otherwise there will be significant overead in each call to execute.

  • You can't use pointers-to-pointers or classes in the kernel, so anything like vectors of vectors or pointers to vectors will need to be turned into linear sequences of bytes.

Apart from that, the process is straightforward, though it needs a lot of boiler-plate OpenCL code on the software side.

If you do this, don't expect an amazing speed-up. If you look at the way that it accesses memory then it doesn't map to a GPU very well. Another problem is that different work-items take different amounts of time...

Optional: Heterogeneous pipelines

The neuron pipeline is designed to allow different engines to be used together, so it is possible to mix and match TBB and OpenCL engines. Depending on the characteristics of each layer, it may be worth moving them to or from hardware...