Skip to content

Latest commit

 

History

History
155 lines (125 loc) · 3.64 KB

03 - SPMD programming model.md

File metadata and controls

155 lines (125 loc) · 3.64 KB

SPMD: Single program multiple data. High level programming languages to program parallelism. The compiled low level machine code may contain SIMD instructions.

ISPC: Intel SPMD Program Compiler

Example program with ISPC

C++ code calls ISPC code.

#include "sinx_ispc.h"

int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N];

/* Initialize x here */

/* Execute ISPC code */
sinx(N, terms, x, result);

Syntax programCount is the number of simultaneously executing instances in the gang. Syntax programIndex is the current instance in the gang. Syntax uniform is a type modifier where each instance have "same" value for this variable (For optimization).

export void sinx(
	uniform int N,
	uniform int terms,
	uniform float* x,
	uniform float* result
) {
	/* Assumes N % programCount = 0 */
	for (uniform int i=0; i<N; i+= programCount) {
		int idx = i + programIndex;
		float value = x[idx];
		float number = x[idx] * x[idx] * x[idx];
		uniform int denom = 6; // 3!
		uniform int sign = - 1;
		
		for (uniform int j=1; j<=terms; j++) {
			value += sign * numer / denom;
			numer *= x[idx] * x[idx];
			denom *= (2*j+2) * (2*j+3);
			sign *= -1;
		}
		result[idx] = value;
	}
}

Even though the programmer can program as if there are programCount instances of the code running concurrently, the ISPC compiler generates .o binary with SIMD instructions.

Interleaved assignment of program instances

Blocked assignment of program instances

export void sinx(
	uniform int N,
	uniform int terms,
	uniform float* x,
	uniform float* result
) {
	/* Assumes N % programCount = 0 */
	uniform int count = N / programCount;
	int start = programIndex * count;
	for (uniform int i=0; i<count; i++) {
		int idx = i + start;
		float value = x[idx];
		float number = x[idx] * x[idx] * x[idx];
		uniform int denom = 6; // 3!
		uniform int sign = - 1;
		
		for (uniform int j=1; j<=terms; j++) {
			value += sign * numer / denom;
			numer *= x[idx] * x[idx];
			denom *= (2*j+2) * (2*j+3);
			sign *= -1;
		}
		result[idx] = value;
	}
}

Interleaved assignment is better: Parallel instructions are done on contiguous memory.

A higher level abstraction with foreach: Each iteration is executed exactly once for a gang of ISPC instances. ISPC determines which instance executes which loop iteration (Interleaved).

export void sinx(
	uniform int N,
	uniform int terms,
	uniform float* x,
	uniform float* result
) {
	foreach (i = 0 ... N) {
		float value = x[i];
		float number = x[i] * x[i] * x[i];
		uniform int denom = 6; // 3!
		uniform int sign = - 1;
		
		for (uniform int j=1; j<=terms; j++) {
			value += sign * numer / denom;
			numer *= x[i] * x[i];
			denom *= (2*j+2) * (2*j+3);
			sign *= -1;
		}
		result[i] = value;
	}
}

ISPC reduction: Each program instance sums into a per-instance variable named partial, then reduce_add adds all the partial floats into one uniform variable.

export uniform float sumall2(
	uniform int N,
	uniform float* x
) {
	uniform float sum;
	float partial = 0.0f;
	foreach (i = 0 ... N) {
		partial += x[i];
	}
	/* From ISPC math library */
	sum = reduce_add(partial);
	return sum;
}

The equivalent C code is shown below.

float sumall2(int N, float* x) {
	float tmp[8];
	__mm256 partial = _mm256_broadcast_ss(0.0f);
	
	for (int i=0; i<N; i+=8)
		partial += _mm256_add_ps(partial, _mm256_load_ps(&x[i]));
	
	_mm256_store_ps(tmp, partial);
	
	float sum = 0.0f;
	for (int i=0; i<8; i++)
		sum += tmp[i];
	
	return sum;
}