Till now we have not paid much attention to speed and possible optimization possibilities inherent in the various compilers. We have compiled and linked as
c++ -c mycode.cpp
c++ -o mycode.exe mycode.o
For Fortran replace with for example gfortran or ifort. This is what we call a flat compiler option and should be used when we develop the code. It produces normally a very large and slow code when translated to machine instructions. We use this option for debugging and for establishing the correct program output because every operation is done precisely as the user specified it.
It is instructive to look up the compiler manual for further instructions by writing
man c++
We have additional compiler options for optimization. These may include procedure inlining where performance may be improved, moving constants inside loops outside the loop, identify potential parallelism, include automatic vectorization or replace a division with a reciprocal and a multiplication if this speeds up the code.
c++ -O3 -c mycode.cpp
c++ -O3 -o mycode.exe mycode.o
This (other options are -O2 or -Ofast) is the recommended option.
It is also useful to profile your program under the development stage. You would then compile with
c++ -pg -O3 -c mycode.cpp
c++ -pg -O3 -o mycode.exe mycode.o
After you have run the code you can obtain the profiling information via
gprof mycode.exe > ProfileOutput
When you have profiled properly your code, you must take out this option as it slows down performance. For memory tests use valgrind. An excellent environment for all these aspects, and much more, is Qt creator.
Adding debugging options is a very useful alternative under the development stage of a program. You would then compile with
c++ -g -O0 -c mycode.cpp
c++ -g -O0 -o mycode.exe mycode.o
This option generates debugging information allowing you to trace for example if an array is properly allocated. Some compilers work best with the no optimization option -O0.
Depending on the compiler, one can add flags which generate code that catches integer overflow errors. The flag -ftrapv does this for the CLANG compiler on OS X operating systems.
In general, irrespective of compiler options, it is useful to
k = n-1;
for (i = 0; i < n; i++){
a[i] = b[i] +c*d;
e = g[k];
}
A better code is
temp = c*d;
for (i = 0; i < n; i++){
a[i] = b[i] + temp;
}
e = g[n-1];
Here we avoid a repeated multiplication inside a loop. Most compilers, depending on compiler flags, identify and optimize such bottlenecks on their own, without requiring any particular action by the programmer. However, it is always useful to single out and avoid code examples like the first one discussed here.
Present CPUs are highly parallel processors with varying levels of parallelism. The typical situation can be described via the following three statements.
One way of categorizing modern parallel computers is to look at the memory configuration.
Vectorization is the process of converting an algorithm that performs scalar operations (typically one operation at the time) to vector operations where a single operation can refer to many simultaneous operations. Consider the following example
for (i = 0; i < n; i++){
a[i] = b[i] + c[i];
}
If the code is not vectorized, the compiler will simply start with the first element and then perform subsequent additions operating on one address in memory at the time.
The number of elements which can be operated on in parallel range from four single-precision floating point data elements in so-called Streaming SIMD Extensions and two double-precision floating-point data elements in Streaming SIMD Extensions 2 to sixteen byte operations in a 128-bit register in Streaming SIMD Extensions 2. Thus, vector-length ranges from 2 to 16, depending on the instruction extensions used and on the data type.
for (i = 0; i < n; i++){
a[i] = b[i] + c[i];
}
If the code is not vectorized and we have a 128-bit register to store a 32 bits floating point number, it means that we have \( 3\times 32 \) bits that are not used. For the first element we have
0 | 1 | 2 | 3 |
---|---|---|---|
a[0]= | not used | not used | not used |
b[0]+ | not used | not used | not used |
c[0] | not used | not used | not used |
We have thus unused space in our SIMD registers. These registers could hold three additional integers.
for (i = 0; i < n; i+=4){
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}
displayed here as
0 | 1 | 2 | 3 |
---|---|---|---|
a[0]= | a[1]= | a[2]= | a[3]= |
b[0]+ | b[1]+ | b[2]+ | b[3]+ |
c[0] | c[1] | c[2] | c[3] |
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
int i = atoi(argv[1]);
double *a, *b, *c;
a = new double[i];
b = new double[i];
c = new double[i];
for (int j = 0; j < i; j++) {
a[j] = 0.0;
b[j] = cos(j*1.0);
c[j] = sin(j*3.0);
}
clock_t start, finish;
start = clock();
for (int j = 0; j < i; j++) {
a[j] = b[j]+b[j]*c[j];
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for vector addition and multiplication=" << timeused << endl;
delete [] a;
delete [] b;
delete [] c;
return 0;
}
c++ -o novec.x vecexample.cpp
and with vectorization (and additional optimizations)
c++ -O3 -o vec.x vecexample.cpp
The speedup depends on the size of the vectors. In the example here we have run with \( 10^7 \) elements. The example here was run on a PC with ubuntu 14.04 as operating system and an Intel i7-4790 CPU running at 3.60 GHz.
Compphys:~ hjensen$ ./vec.x 10000000
Time used for vector addition = 0.0100000
Compphys:~ hjensen$ ./novec.x 10000000
Time used for vector addition = 0.03000000000
This particular C++ compiler speeds up the above loop operations with a factor of 3. Performing the same operations for \( 10^8 \) elements results only in a factor \( 1.4 \). The result will however vary from compiler to compiler. In general however, with optimization flags like \( -O3 \) or \( -Ofast \), we gain a considerable speedup if our code can be vectorized. Many of these operations can be done automatically by your compiler. These automatic or near automatic compiler techniques improve performance considerably.
Not all loops can be vectorized, as discussed in Intel's guide to vectorization
An important criteria is that the loop counter \( n \) is known at the entry of the loop.
for (int j = 0; j < n; j++) {
a[j] = cos(j*1.0);
}
The variable \( n \) does need to be known at compile time. However, this variable must stay the same for the entire duration of the loop. It implies that an exit statement inside the loop cannot be data dependent.
An exit statement should in general be avoided. If the exit statement contains data-dependent conditions, the loop cannot be vectorized. The following is an example of a non-vectorizable loop
for (int j = 0; j < n; j++) {
a[j] = cos(j*1.0);
if (a[j] < 0 ) break;
}
Avoid loop termination conditions and opt for a single entry loop variable \( n \). The lower and upper bounds have to be kept fixed within the loop.
for (int j = 0; j < n; j++) {
double x = cos(j*1.0);
if (x > 0 ) {
a[j] = x*sin(j*2.0);
}
else {
a[j] = 0.0;
}
}
These operations can be performed for all data elements but only those elements which the mask evaluates as true are stored. In general, one should avoid branches such as switch, go to, or return statements or if constructs that cannot be treated as masked assignments.
Only the innermost loop of the following example is vectorized
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] += b[i][j];
}
}
The exception is if an original outer loop is transformed into an inner loop as the result of compiler optimizations.
Calls to programmer defined functions ruin vectorization. However, calls to intrinsic functions like \( \sin{x} \), \( \cos{x} \), \( \exp{x} \) etc are allowed since they are normally efficiently vectorized. The following example is fully vectorizable
for (int i = 0; i < n; i++) {
a[i] = log10(i)*cos(i);
}
Similarly, inline functions defined by the programmer, allow for vectorization since the function statements are glued into the actual place where the function is called.
One has to keep in mind that vectorization changes the order of operations inside a loop. A so-called read-after-write statement with an explicit flow dependency cannot be vectorized. The following code
double b = 15.;
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b;
}
is an example of flow dependency and results in wrong numerical results if vectorized. For a scalar operation, the value \( a[i-1] \) computed during the iteration is loaded into the right-hand side and the results are fine. In vector mode however, with a vector length of four, the values \( a[0] \), \( a[1] \), \( a[2] \) and \( a[3] \) from the previous loop will be loaded into the right-hand side and produce wrong results. That is, we have
a[1] = a[0] + b;
a[2] = a[1] + b;
a[3] = a[2] + b;
a[4] = a[3] + b;
and if the two first iterations are executed at the same by the SIMD instruction, the value of say \( a[1] \) could be used by the second iteration before it has been calculated by the first iteration, leading thereby to wrong results.
On the other hand, a so-called write-after-read statement can be vectorized. The following code
double b = 15.;
for (int i = 1; i < n; i++) {
a[i-1] = a[i] + b;
}
is an example of flow dependency that can be vectorized since no iteration with a higher value of \( i \) can complete before an iteration with a lower value of \( i \). However, such code leads to problems with parallelization.
For C++ programmers it is also worth keeping in mind that an array notation is preferred to the more compact use of pointers to access array elements. The compiler can often not tell if it is safe to vectorize the code.
When dealing with arrays, you should also avoid memory stride, since this slows down considerably vectorization. When you access array element, write for example the inner loop to vectorize using unit stride, that is, access successively the next array element in memory, as shown here
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] += b[i][j];
}
}
clang -o novec.x vecexample.cpp
and with vectorization (and additional optimizations)
clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
The speedup depends on the size of the vectors. In the example here we have run with \( 10^7 \) elements. The example here was run on an IMac17.1 with OSX El Capitan (10.11.4) as operating system and an Intel i5 3.3 GHz CPU.
Compphys:~ hjensen$ ./vec.x 10000000
Time used for norm computation=0.04720500000
Compphys:~ hjensen$ ./novec.x 10000000
Time used for norm computation=0.03311700000
This particular C++ compiler speeds up the above loop operations with a factor of 1.5 Performing the same operations for \( 10^9 \) elements results in a smaller speedup since reading from main memory is required. The non-vectorized code is seemingly faster.
Compphys:~ hjensen$ ./vec.x 1000000000
Time used for norm computation=58.41391100
Compphys:~ hjensen$ ./novec.x 1000000000
Time used for norm computation=46.51295300
We will discuss these issues further in the next slides.
clang++ -o -fno-vectorize novec.x vecexample.cpp
and with vectorization
clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
We can also add vectorization analysis, see for example
clang++ -O3 -Rpass-analysis=loop-vectorize -o vec.x vecexample.cpp
or figure out if vectorization was missed
clang++ -O3 -Rpass-missed=loop-vectorize -o vec.x vecexample.cpp
How do we measure erformance? What is wrong with this code to time a loop?
clock_t start, finish;
start = clock();
for (int j = 0; j < i; j++) {
a[j] = b[j]+b[j]*c[j];
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
What happens when the code is executed? The assumption is that the code is ready to execute. But
$$ \left( \begin{array}{ccccc} b_0 & c_0 & & & \\ a_0 & b_1 & c_1 & & \\ & & \ddots & & \\ & & a_{m-3} & b_{m-2} & c_{m-2} \\ & & & a_{m-2} & b_{m-1} \end{array} \right) \left( \begin{array}{c} x_0 \\ x_1 \\ \vdots \\ x_{m-2} \\ x_{m-1} \end{array} \right)=\left( \begin{array}{c} f_0 \\ f_1 \\ \vdots \\ f_{m-2} \\ f_{m-1} \\ \end{array} \right) $$
The first step is to multiply the first row by \( a_0/b_0 \) and subtract it from the second row. This is known as the forward substitution step. We obtain then $$ a_i = 0, $$ $$ b_i = b_i - \frac{a_{i-1}}{b_{i-1}}c_{i-1}, $$ and $$ f_i = f_i - \frac{a_{i-1}}{b_{i-1}}f_{i-1}. $$ At this point the simplified equation, with only an upper triangular matrix takes the form $$ \left( \begin{array}{ccccc} b_0 & c_0 & & & \\ & b_1 & c_1 & & \\ & & \ddots & & \\ & & & b_{m-2} & c_{m-2} \\ & & & & b_{m-1} \end{array} \right)\left( \begin{array}{c} x_0 \\ x_1 \\ \vdots \\ x_{m-2} \\ x_{m-1} \end{array} \right)=\left( \begin{array}{c} f_0 \\ f_1 \\ \vdots \\ f_{m-2} \\ f_{m-1} \\ \end{array} \right) $$
The next step is the backward substitution step. The last row is multiplied by \( c_{N-3}/b_{N-2} \) and subtracted from the second to last row, thus eliminating \( c_{N-3} \) from the last row. The general backward substitution procedure is $$ c_i = 0, $$ and $$ f_{i-1} = f_{i-1} - \frac{c_{i-1}}{b_i}f_i $$ All that remains to be computed is the solution, which is the very straight forward process of $$ x_i = \frac{f_i}{b_i} $$
Operation | Floating Point |
---|---|
Memory Reads | \( 14(N-2) \) |
Memory Writes | \( 4(N-2) \) |
Subtractions | \( 3(N-2) \) |
Multiplications | \( 3(N-2) \) |
Divisions | \( 4(N-2) \) |
// Forward substitution
// Note that we can simplify by precalculating a[i-1]/b[i-1]
for (int i=1; i < n; i++) {
b[i] = b[i] - (a[i-1]*c[i-1])/b[i-1];
f[i] = g[i] - (a[i-1]*f[i-1])/b[i-1];
}
x[n-1] = f[n-1] / b[n-1];
// Backwards substitution
for (int i = n-2; i >= 0; i--) {
f[i] = f[i] - c[i]*f[i+1]/b[i+1];
x[i] = f[i]/b[i];
}
Operation | Floating Point |
---|---|
Memory Reads | \( 6(N-2) \) |
Memory Writes | \( 2(N-2) \) |
Additions | \( 2(N-2) \) |
Divisions | \( 2(N-2) \) |
// Forward substitution cannot be vectorized
for (int i = 2; i < n; i++) b[i] = b[i] + b[i-1]/d[i-1];
// Backward substitution cannot be vectorized
solution[n-1] = b[n-1]/d[n-1];
for (int i = n-2; i > 0; i--) solution[i] = (b[i]+solution[i+1])/d[i];
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double **A, **B;
// Allocate space for the two matrices
A = new double*[n]; B = new double*[n];
for (int i = 0; i < n; i++){
A[i] = new double[n];
B[i] = new double[n];
}
// Set up values for matrix A
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
A[i][j] = cos(i*1.0)*sin(j*3.0);
}
}
clock_t start, finish;
start = clock();
// Then compute the transpose
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
B[i][j]= A[j][i];
}
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for setting up transpose of matrix=" << timeused << endl;
// Free up space
for (int i = 0; i < n; i++){
delete[] A[i];
delete[] B[i];
}
delete[] A;
delete[] B;
return 0;
}
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double s = 1.0/sqrt( (double) n);
double **A, **B, **C;
// Start timing
clock_t start, finish;
start = clock();
// Allocate space for the two matrices
A = new double*[n]; B = new double*[n]; C = new double*[n];
for (int i = 0; i < n; i++){
A[i] = new double[n];
B[i] = new double[n];
C[i] = new double[n];
}
// Set up values for matrix A and B and zero matrix C
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
double angle = 2.0*M_PI*i*j/ (( double ) n);
A[i][j] = s * ( sin ( angle ) + cos ( angle ) );
B[j][i] = A[i][j];
}
}
// Then perform the matrix-matrix multiplication
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++) {
sum += B[i][k]*A[k][j];
}
C[i][j] = sum;
}
}
// Compute now the Frobenius norm
double Fsum = 0.0;
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
Fsum += C[i][j]*C[i][j];
}
}
Fsum = sqrt(Fsum);
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for matrix-matrix multiplication=" << timeused << endl;
cout << " Frobenius norm = " << Fsum << endl;
// Free up space
for (int i = 0; i < n; i++){
delete[] A[i];
delete[] B[i];
delete[] C[i];
}
delete[] A;
delete[] B;
delete[] C;
return 0;
}
The key is choosing the correct baseline for comparison
For parallel applications, speedup is typically defined as
The speedup on \( p \) processors can be greater than \( p \) if memory usage is optimal! Consider the case of a memorybound computation with \( M \) words of memory
Assume that almost all parts of a code are perfectly parallelizable (fraction \( f \)). The remainder, fraction \( (1-f) \) cannot be parallelized at all.
That is, there is work that takes time \( W \) on one process; a fraction \( f \) of that work will take time \( Wf/p \) on \( p \) processors.
On one processor we have $$ T_1 = (1-f)W + fW = W $$ On \( p \) processors we have $$ T_p = (1-f)W + \frac{fW}{p}, $$ resulting in a speedup of $$ \frac{T_1}{T_p} = \frac{W}{(1-f)W+fW/p} $$
As p goes to infinity, \( fW/p \) goes to zero, and the maximum speedup is $$ \frac{1}{1-f}, $$ meaning that if if \( f = 0.99 \) (all but \( 1\% \) parallelizable), the maximum speedup is \( 1/(1-.99)=100 \)!
If any non-parallel code slips into the application, the parallel performance is limited.
In many simulations, however, the fraction of non-parallelizable work is \( 10^{-6} \) or less due to large arrays or objects that are perfectly parallelizable.
To install MPI is rather easy on hardware running unix/linux as operating systems, follow simply the instructions from the OpenMPI website. See also subsequent slides. When you have made sure you have installed MPI on your PC/laptop,
# Compile and link
mpic++ -O3 -o nameofprog.x nameofprog.cpp
# run code with for example 8 processes using mpirun/mpiexec
mpiexec -n 8 ./nameofprog.x
If you wish to install MPI and OpenMP on your laptop/PC, we recommend the following:
brew install libomp
and compile and link as
c++ -o <name executable> <name program.cpp> -lomp
For linux/ubuntu users, you need to install two packages (alternatively use the synaptic package manager)
sudo apt-get install libopenmpi-dev
sudo apt-get install openmpi-bin
For OS X users, install brew (after having installed xcode and gcc, needed for the gfortran compiler of openmpi) and then install with brew
brew install openmpi
When running an executable (code.x), run as
mpirun -n 10 ./code.x
where we indicate that we want the number of processes to be 10.
With openmpi installed, when using Qt, add to your .pro file the instructions here
You may need to tell Qt where openmpi is stored.
For the machines at the computer lab, openmpi is located at
/usr/lib64/openmpi/bin
Add to your .bashrc file the following
export PATH=/usr/lib64/openmpi/bin:$PATH
For running on SMAUG, go to http://comp-phys.net/ and click on the link internals and click on computing cluster. To get access to Smaug, you will need to send us an e-mail with your name, UiO username, phone number, room number and affiliation to the research group. In return, you will receive a password you may use to access the cluster.
Here follows a simple recipe
log in as ssh -username tid.uio.no
ssh username@fyslab-compphys
In the folder
shared/guides/starting_jobs
you will find a simple example on how to set up a job and compile and run. This files are write protected. Copy them to your own folder and compile and run there. For more information see the readme file under the program folder.
#include <omp.h>
#pragma omp...
#pragma omp construct [ clause ...]
#include <omp.h>
#include <omp.h>
main ()
{
int var1, var2, var3;
/* serial code */
/* ... */
/* start of a parallel region */
#pragma omp parallel private(var1, var2) shared(var3)
{
/* ... */
}
/* more serial code */
/* ... */
/* another parallel region */
#pragma omp parallel
{
/* ... */
}
}
#pragma omp parallel { ... }
#include <omp.h>
#include <cstdio>
int main (int argc, char *argv[])
{
int th_id, nthreads;
#pragma omp parallel private(th_id) shared(nthreads)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return 0;
}
#include <cstdio>
#include <omp.h>
int main(int argc, char *argv[])
{
omp_set_num_threads(4);
#pragma omp parallel
{
int id = omp_get_thread_num();
int nproc = omp_get_num_threads();
cout << "Hello world with id number and processes " << id << nproc << endl;
}
return 0;
}
Variables declared outside of the parallel region are shared by all threads If a variable like id is declared outside of the
#pragma omp parallel,
it would have been shared by various the threads, possibly causing erroneous output
Private clause can be used to make thread- private versions of such variables:
#pragma omp parallel private(id)
{
int id = omp_get_thread_num();
cout << "My thread num" << id << endl;
}
It is often useful to have only one thread execute some of the code in a parallel region. I/O statements are a common example
#pragma omp parallel
{
#pragma omp master
{
int id = omp_get_thread_num();
cout << "My thread num" << id << endl;
}
}
#pragma omp for
OpenMP provides an easy way to parallelize a loop
#pragma omp parallel for
for (i=0; i<n; i++) c[i] = a[i];
OpenMP handles index variable (no need to declare in for loop or make private)
Which thread does which values? Several options.
We can let the OpenMP runtime decide. The decision is about how the loop iterates are scheduled and OpenMP defines three choices of loop scheduling:
#include <omp.h>
#define CHUNKSIZE 100
#define N 1000
int main (int argc, char *argv[])
{
int i, chunk;
float a[N], b[N], c[N];
for (i=0; i < N; i++) a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for (i=0; i < N; i++) c[i] = a[i] + b[i];
} /* end of parallel region */
}
#include <omp.h>
#define CHUNKSIZE 100
#define N 1000
int main (int argc, char *argv[])
{
int i, chunk;
float a[N], b[N], c[N];
for (i=0; i < N; i++) a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(guided,chunk)
for (i=0; i < N; i++) c[i] = a[i] + b[i];
} /* end of parallel region */
}
// #pragma omp parallel and #pragma omp for
can be combined into
#pragma omp parallel for
What happens with code like this
#pragma omp parallel for
for (i=0; i<n; i++) sum += a[i]*a[i];
All threads can access the sum variable, but the addition is not atomic! It is important to avoid race between threads. So-called reductions in OpenMP are thus important for performance and for obtaining correct results. OpenMP lets us indicate that a variable is used for a reduction with a particular operator. The above code becomes
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0; i<n; i++) sum += a[i]*a[i];
$$ \sum_{i=0}^{n-1} a_ib_i $$
int i;
double sum = 0.;
/* allocating and initializing arrays */
/* ... */
#pragma omp parallel for default(shared) private(i) reduction(+:sum)
for (i=0; i<N; i++) sum += a[i]*b[i];
}
Different threads do different tasks independently, each section is executed by one thread.
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
funcA ();
#pragma omp section
funcB ();
#pragma omp section
funcC ();
}
}
#pragma omp single { ... }
The code is executed by one thread only, no guarantee which thread
Can introduce an implicit barrier at the end
#pragma omp master { ... }
Code executed by the master thread, guaranteed and no implicit barrier at the end.
#pragma omp barrier
Synchronization, must be encountered by all threads in a team (or none)
#pragma omp ordered { a block of codes }
is another form of synchronization (in sequential order). The form
#pragma omp critical { a block of codes }
and
#pragma omp atomic { single assignment statement }
is more efficient than
#pragma omp critical
for (i=0; i<100; i++)
for (j=0; j<100; j++)
a[i][j] = b[i][j] + c[i][j];
}
}
#pragma omp parallel for private(j)
for (i=0; i<100; i++)
for (j=0; j<100; j++)
a[i][j] = b[i][j] + c[i][j];
}
}
When a thread in a parallel region encounters another parallel construct, it may create a new team of threads and become the master of the new team.
#pragma omp parallel num_threads(4)
{
/* .... */
#pragma omp parallel num_threads(2)
{
//
}
}
#pragma omp task
#pragma omp parallel shared(p_vec) private(i)
{
#pragma omp single
{
for (i=0; i<N; i++) {
double r = random_number();
if (p_vec[i] > r) {
#pragma omp task
do_work (p_vec[i]);
Race condition
int nthreads;
#pragma omp parallel shared(nthreads)
{
nthreads = omp_get_num_threads();
}
Deadlock
#pragma omp parallel
{
...
#pragma omp critical
{
...
#pragma omp barrier
}
}
Not all computations are simple loops where the data can be evenly divided among threads without any dependencies between threads
An example is finding the location and value of the largest element in an array
for (i=0; i<n; i++) {
if (x[i] > maxval) {
maxval = x[i];
maxloc = i;
}
}
All threads are potentially accessing and changing the same values, maxloc and maxval.
#pragma omp atomic
#pragma omp critical
Write down the simplest algorithm and look carefully for race conditions. How would you handle them? The first step would be to parallelize as
#pragma omp parallel for
for (i=0; i<n; i++) {
if (x[i] > maxval) {
maxval = x[i];
maxloc = i;
}
}
Write down the simplest algorithm and look carefully for race conditions. How would you handle them? The first step would be to parallelize as
#pragma omp parallel for
for (i=0; i<n; i++) {
#pragma omp critical
{
if (x[i] > maxval) {
maxval = x[i];
maxloc = i;
}
}
}
Exercise: write a code which implements this and give an estimate on performance. Perform several runs, with a serial code only with and without vectorization and compare the serial code with the one that uses OpenMP. Run on different archictectures if you can.
Performance poor because we insisted on keeping track of the maxval and location during the execution of the loop.
Idea: Have each thread find the maxloc in its own data, then combine and use temporary arrays indexed by thread number to hold the values found by each thread
int maxloc[MAX_THREADS], mloc;
double maxval[MAX_THREADS], mval;
#pragma omp parallel shared(maxval,maxloc)
{
int id = omp_get_thread_num();
maxval[id] = -1.0e30;
#pragma omp for
for (int i=0; i<n; i++) {
if (x[i] > maxval[id]) {
maxloc[id] = i;
maxval[id] = x[i];
}
}
}
#pragma omp flush (maxloc,maxval)
#pragma omp master
{
int nt = omp_get_num_threads();
mloc = maxloc[0];
mval = maxval[0];
for (int i=1; i<nt; i++) {
if (maxval[i] > mval) {
mval = maxval[i];
mloc = maxloc[i];
}
}
}
Note that we let the master process perform the last operation.
// OpenMP program to compute vector norm by adding two other vectors
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include <omp.h>
# include <ctime>
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of vector
int n = atoi(argv[1]);
double *a, *b, *c;
int i;
int thread_num;
double wtime, Norm2, s, angle;
cout << " Perform addition of two vectors and compute the norm-2." << endl;
omp_set_num_threads(4);
thread_num = omp_get_max_threads ();
cout << " The number of processors available = " << omp_get_num_procs () << endl ;
cout << " The number of threads available = " << thread_num << endl;
cout << " The matrix order n = " << n << endl;
s = 1.0/sqrt( (double) n);
wtime = omp_get_wtime ( );
// Allocate space for the vectors to be used
a = new double [n]; b = new double [n]; c = new double [n];
// Define parallel region
# pragma omp parallel for default(shared) private (angle, i) reduction(+:Norm2)
// Set up values for vectors a and b
for (i = 0; i < n; i++){
angle = 2.0*M_PI*i/ (( double ) n);
a[i] = s*(sin(angle) + cos(angle));
b[i] = s*sin(2.0*angle);
c[i] = 0.0;
}
// Then perform the vector addition
for (i = 0; i < n; i++){
c[i] += a[i]+b[i];
}
// Compute now the norm-2
Norm2 = 0.0;
for (i = 0; i < n; i++){
Norm2 += c[i]*c[i];
}
// end parallel region
wtime = omp_get_wtime ( ) - wtime;
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for norm-2 computation=" << wtime << endl;
cout << " Norm-2 = " << Norm2 << endl;
// Free up space
delete[] a;
delete[] b;
delete[] c;
return 0;
}
// Matrix-matrix multiplication and Frobenius norm of a matrix with OpenMP
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include <omp.h>
# include <ctime>
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double **A, **B, **C;
int i, j, k;
int thread_num;
double wtime, Fsum, s, angle;
cout << " Compute matrix product C = A * B and Frobenius norm." << endl;
omp_set_num_threads(4);
thread_num = omp_get_max_threads ();
cout << " The number of processors available = " << omp_get_num_procs () << endl ;
cout << " The number of threads available = " << thread_num << endl;
cout << " The matrix order n = " << n << endl;
s = 1.0/sqrt( (double) n);
wtime = omp_get_wtime ( );
// Allocate space for the two matrices
A = new double*[n]; B = new double*[n]; C = new double*[n];
for (i = 0; i < n; i++){
A[i] = new double[n];
B[i] = new double[n];
C[i] = new double[n];
}
// Define parallel region
# pragma omp parallel for default(shared) private (angle, i, j, k) reduction(+:Fsum)
// Set up values for matrix A and B and zero matrix C
for (i = 0; i < n; i++){
for (j = 0; j < n; j++) {
angle = 2.0*M_PI*i*j/ (( double ) n);
A[i][j] = s * ( sin ( angle ) + cos ( angle ) );
B[j][i] = A[i][j];
}
}
// Then perform the matrix-matrix multiplication
for (i = 0; i < n; i++){
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
for (k = 0; k < n; k++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
// Compute now the Frobenius norm
Fsum = 0.0;
for (i = 0; i < n; i++){
for (j = 0; j < n; j++) {
Fsum += C[i][j]*C[i][j];
}
}
Fsum = sqrt(Fsum);
// end parallel region and letting only one thread perform I/O
wtime = omp_get_wtime ( ) - wtime;
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for matrix-matrix multiplication=" << wtime << endl;
cout << " Frobenius norm = " << Fsum << endl;
// Free up space
for (int i = 0; i < n; i++){
delete[] A[i];
delete[] B[i];
delete[] C[i];
}
delete[] A;
delete[] B;
delete[] C;
return 0;
}
MPI is a library, not a language. It specifies the names, calling sequences and results of functions or subroutines to be called from C/C++ or Fortran programs, and the classes and methods that make up the MPI C++ library. The programs that users write in Fortran, C or C++ are compiled with ordinary compilers and linked with the MPI library.
MPI programs should be able to run on all possible machines and run all MPI implementetations without change.
An MPI computation is a collection of processes communicating with messages.
Task parallelism: the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulations or numerical integration are examples of this.
MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI_Command_name
and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI_COMMAND_NAME
MPI is a library specification for the message passing interface, proposed as a standard.
Insert communication and synchronization functions where necessary.
MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI_Command_name
and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI_COMMAND_NAME
The discussion in these slides focuses on the C++ binding.
MPI_COMM_WORLD
Let every process write "Hello world" (oh not this program again!!) on the standard output.
using namespace std;
#include <mpi.h>
#include <iostream>
int main (int nargs, char* args[])
{
int numprocs, my_rank;
// MPI initializations
MPI_Init (&nargs, &args);
MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
cout << "Hello world, I have rank " << my_rank << " out of "
<< numprocs << endl;
// End MPI
MPI_Finalize ();
PROGRAM hello
INCLUDE "mpif.h"
INTEGER:: size, my_rank, ierr
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
WRITE(*,*)"Hello world, I've rank ",my_rank," out of ",size
CALL MPI_FINALIZE(ierr)
END PROGRAM hello
int main (int nargs, char* args[])
{
int numprocs, my_rank, i;
MPI_Init (&nargs, &args);
MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
for (i = 0; i < numprocs; i++) {}
MPI_Barrier (MPI_COMM_WORLD);
if (i == my_rank) {
cout << "Hello world, I have rank " << my_rank <<
" out of " << numprocs << endl;}
MPI_Finalize ();
.....
int numprocs, my_rank, flag;
MPI_Status status;
MPI_Init (&nargs, &args);
MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
if (my_rank > 0)
MPI_Recv (&flag, 1, MPI_INT, my_rank-1, 100,
MPI_COMM_WORLD, &status);
cout << "Hello world, I have rank " << my_rank << " out of "
<< numprocs << endl;
if (my_rank < numprocs-1)
MPI_Send (&my_rank, 1, MPI_INT, my_rank+1,
100, MPI_COMM_WORLD);
MPI_Finalize ();
The basic sending of messages is given by the function \( MPI\_SEND \), which in C/C++ is defined as
int MPI_Send(void *buf, int count,
MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)}
This single command allows the passing of any kind of variable, even a large array, to any group of tasks. The variable buf is the variable we wish to send while count is the number of variables we are passing. If we are passing only a single value, this should be 1.
If we transfer an array, it is the overall size of the array. For example, if we want to send a 10 by 10 array, count would be \( 10\times 10=100 \) since we are actually passing 100 values.
Once you have sent a message, you must receive it on another task. The function \( MPI\_RECV \) is similar to the send call.
int MPI_Recv( void *buf, int count, MPI_Datatype datatype,
int source,
int tag, MPI_Comm comm, MPI_Status *status )
The arguments that are different from those in MPI\_SEND are buf which is the name of the variable where you will be storing the received data, source which replaces the destination in the send command. This is the return ID of the sender.
Finally, we have used \( MPI\_Status\_status \), where one can check if the receive was completed.
The output of this code is the same as the previous example, but now process 0 sends a message to process 1, which forwards it further to process 2, and so forth.
// Trapezoidal rule and numerical integration usign MPI
using namespace std;
#include <mpi.h>
#include <iostream>
// Here we define various functions called by the main program
double int_function(double );
double trapezoidal_rule(double , double , int , double (*)(double));
// Main function begins here
int main (int nargs, char* args[])
{
int n, local_n, numprocs, my_rank;
double a, b, h, local_a, local_b, total_sum, local_sum;
double time_start, time_end, total_time;
// MPI initializations
MPI_Init (&nargs, &args);
MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
time_start = MPI_Wtime();
// Fixed values for a, b and n
a = 0.0 ; b = 1.0; n = 1000;
h = (b-a)/n; // h is the same for all processes
local_n = n/numprocs;
// make sure n > numprocs, else integer division gives zero
// Length of each process' interval of
// integration = local_n*h.
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
total_sum = 0.0;
local_sum = trapezoidal_rule(local_a, local_b, local_n,
&int_function);
MPI_Reduce(&local_sum, &total_sum, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
time_end = MPI_Wtime();
total_time = time_end-time_start;
if ( my_rank == 0) {
cout << "Trapezoidal rule = " << total_sum << endl;
cout << "Time = " << total_time
<< " on number of processors: " << numprocs << endl;
}
// End MPI
MPI_Finalize ();
return 0;
} // end of main program
Here we have used
MPI_reduce( void *senddata, void* resultdata, int count,
MPI_Datatype datatype, MPI_Op, int root, MPI_Comm comm)
The two variables \( senddata \) and \( resultdata \) are obvious, besides the fact that one sends the address of the variable or the first element of an array. If they are arrays they need to have the same size. The variable \( count \) represents the total dimensionality, 1 in case of just one variable, while \( MPI\_Datatype \) defines the type of variable which is sent and received.
The new feature is \( MPI\_Op \). It defines the type of operation we want to do.
In our case, since we are summing the rectangle contributions from every process we define \( MPI\_Op = MPI\_SUM \). If we have an array or matrix we can search for the largest og smallest element by sending either \( MPI\_MAX \) or \( MPI\_MIN \). If we want the location as well (which array element) we simply transfer \( MPI\_MAXLOC \) or \( MPI\_MINOC \). If we want the product we write \( MPI\_PROD \).
\( MPI\_Allreduce \) is defined as
MPI_Allreduce( void *senddata, void* resultdata, int count,
MPI_Datatype datatype, MPI_Op, MPI_Comm comm)
We use \( MPI\_reduce \) to collect data from each process. Note also the use of the function \( MPI\_Wtime \).
// this function defines the function to integrate
double int_function(double x)
{
double value = 4./(1.+x*x);
return value;
} // end of function to evaluate
// this function defines the trapezoidal rule
double trapezoidal_rule(double a, double b, int n,
double (*func)(double))
{
double trapez_sum;
double fa, fb, x, step;
int j;
step=(b-a)/((double) n);
fa=(*func)(a)/2. ;
fb=(*func)(b)/2. ;
trapez_sum=0.;
for (j=1; j <= n-1; j++){
x=j*step+a;
trapez_sum+=(*func)(x);
}
trapez_sum=(trapez_sum+fb+fa)*step;
return trapez_sum;
} // end trapezoidal_rule