My master's thesis on algorithm engineering in the context of data-oblivious sorting.
It covers a wide range of algorithms, some classic sorting networks, and some newer data-oblivious random sorters.
Looking at running time, instruction count, cache misses, and branch mispredictions, algorithms are augmented using CUDA, OpenMP, and SSE, I attempt to make some interesting observations on the state of data-oblivious sorting.
Also, a small tool for flattening sorting networks is included
See Tex/thesis.pdf for the actual thesis.