Optimization and Vectorization
Contents
8. Optimization and Vectorization¶
8.1. Optimization and profiling¶
Till now we have not paid much attention to speed and possible optimization possibilities inherent in the various compilers. We have compiled and linked as
c++ -c mycode.cpp
c++ -o mycode.exe mycode.o
For Fortran replace with for example gfortran or ifort. This is what we call a flat compiler option and should be used when we develop the code. It produces normally a very large and slow code when translated to machine instructions. We use this option for debugging and for establishing the correct program output because every operation is done precisely as the user specified it.
It is instructive to look up the compiler manual for further instructions by writing
man c++
8.2. More on optimization¶
We have additional compiler options for optimization. These may include procedure inlining where performance may be improved, moving constants inside loops outside the loop, identify potential parallelism, include automatic vectorization or replace a division with a reciprocal and a multiplication if this speeds up the code.
c++ -O3 -c mycode.cpp
c++ -O3 -o mycode.exe mycode.o
This (other options are -O2 or -Ofast) is the recommended option.
8.3. Optimization and profiling¶
It is also useful to profile your program under the development stage. You would then compile with
c++ -pg -O3 -c mycode.cpp
c++ -pg -O3 -o mycode.exe mycode.o
After you have run the code you can obtain the profiling information via
gprof mycode.exe > ProfileOutput
When you have profiled properly your code, you must take out this option as it slows down performance. For memory tests use valgrind. An excellent environment for all these aspects, and much more, is Qt creator.
8.4. Optimization and debugging¶
Adding debugging options is a very useful alternative under the development stage of a program. You would then compile with
c++ -g -O0 -c mycode.cpp
c++ -g -O0 -o mycode.exe mycode.o
This option generates debugging information allowing you to trace for example if an array is properly allocated. Some compilers work best with the no optimization option -O0.
Other optimization flags.
Depending on the compiler, one can add flags which generate code that catches integer overflow errors. The flag -ftrapv does this for the CLANG compiler on OS X operating systems.
8.5. Other hints¶
In general, irrespective of compiler options, it is useful to
avoid if tests or call to functions inside loops, if possible.
avoid multiplication with constants inside loops if possible
Here is an example of a part of a program where specific operations lead to a slower code
k = n-1;
for (i = 0; i < n; i++){
a[i] = b[i] +c*d;
e = g[k];
}
A better code is
temp = c*d;
for (i = 0; i < n; i++){
a[i] = b[i] + temp;
}
e = g[n-1];
Here we avoid a repeated multiplication inside a loop. Most compilers, depending on compiler flags, identify and optimize such bottlenecks on their own, without requiring any particular action by the programmer. However, it is always useful to single out and avoid code examples like the first one discussed here.
8.6. Vectorization and the basic idea behind parallel computing¶
Present CPUs are highly parallel processors with varying levels of parallelism. The typical situation can be described via the following three statements.
Pursuit of shorter computation time and larger simulation size gives rise to parallel computing.
Multiple processors are involved to solve a global problem.
The essence is to divide the entire computation evenly among collaborative processors. Divide and conquer.
Before we proceed with a more detailed discussion of topics like vectorization and parallelization, we need to remind ourselves about some basic features of different hardware models.
8.7. A rough classification of hardware models¶
Conventional single-processor computers are named SISD (single-instruction-single-data) machines.
SIMD (single-instruction-multiple-data) machines incorporate the idea of parallel processing, using a large number of processing units to execute the same instruction on different data.
Modern parallel computers are so-called MIMD (multiple-instruction-multiple-data) machines and can execute different instruction streams in parallel on different data.
8.9. Different parallel programming paradigms¶
Task parallelism: the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulations represent a typical situation. Integration is another. However this paradigm is of limited use.
Data parallelism: use of multiple threads (e.g. one or more threads per processor) to dissect loops over arrays etc. Communication and synchronization between processors are often hidden, thus easy to program. However, the user surrenders much control to a specialized compiler. Examples of data parallelism are compiler-based parallelization and OpenMP directives.
8.10. Different parallel programming paradigms¶
Message passing: all involved processors have an independent memory address space. The user is responsible for partitioning the data/work of a global problem and distributing the subproblems to the processors. Collaboration between processors is achieved by explicit message passing, which is used for data transfer plus synchronization.
This paradigm is the most general one where the user has full control. Better parallel efficiency is usually achieved by explicit message passing. However, message-passing programming is more difficult.
8.11. What is vectorization?¶
Vectorization is a special case of Single Instructions Multiple Data (SIMD) to denote a single instruction stream capable of operating on multiple data elements in parallel. We can think of vectorization as the unrolling of loops accompanied with SIMD instructions.
Vectorization is the process of converting an algorithm that performs scalar operations (typically one operation at the time) to vector operations where a single operation can refer to many simultaneous operations. Consider the following example
for (i = 0; i < n; i++){
a[i] = b[i] + c[i];
}
If the code is not vectorized, the compiler will simply start with the first element and then perform subsequent additions operating on one address in memory at the time.
8.12. Number of elements that can acted upon¶
A SIMD instruction can operate on multiple data elements in one single instruction.
It uses the so-called 128-bit SIMD floating-point register.
In this sense, vectorization adds some form of parallelism since one instruction is applied
to many parts of say a vector.
The number of elements which can be operated on in parallel range from four single-precision floating point data elements in so-called Streaming SIMD Extensions and two double-precision floating-point data elements in Streaming SIMD Extensions 2 to sixteen byte operations in a 128-bit register in Streaming SIMD Extensions 2. Thus, vector-length ranges from 2 to 16, depending on the instruction extensions used and on the data type.
IN summary, our instructions operate on 128 bit (16 byte) operands
4 floats or ints
2 doubles
Data paths 128 bits vide for vector unit
8.13. Number of elements that can acted upon, examples¶
We start with the simple scalar operations given by
for (i = 0; i < n; i++){
a[i] = b[i] + c[i];
}
If the code is not vectorized and we have a 128-bit register to store a 32 bits floating point number, it means that we have \(3\times 32\) bits that are not used. For the first element we have
0 | 1 | 2 | 3 |
---|---|---|---|
a[0]= | not used | not used | not used |
b[0]+ | not used | not used | not used |
c[0] | not used | not used | not used |
8.14. Operation counts for scalar operation¶
The code
for (i = 0; i < n; i++){
a[i] = b[i] + c[i];
}
has for \(n\) repeats
one load for \(c[i]\) in address 1
one load for \(b[i]\) in address 2
add \(c[i]\) and \(b[i]\) to give \(a[i]\)
store \(a[i]\) in address 2
8.15. Number of elements that can acted upon, examples¶
If we vectorize the code, we can perform, with a 128-bit register four simultaneous operations, that is we have
for (i = 0; i < n; i+=4){
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}
displayed here as
0 | 1 | 2 | 3 |
---|---|---|---|
a[0]= | a[1]= | a[2]= | a[3]= |
b[0]+ | b[1]+ | b[2]+ | b[3]+ |
c[0] | c[1] | c[2] | c[3] |
8.16. Number of operations when vectorized¶
For \(n/4\) repeats assuming floats or integers
one vector load for \(c[i]\) in address 1
one load for \(b[i]\) in address 2
add \(c[i]\) and \(b[i]\) to give \(a[i]\)
store \(a[i]\) in address 2
8.17. A simple test case with and without vectorization¶
We implement these operations in a simple c++ program that computes at the end the norm of a vector.
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double s = 1.0/sqrt( (double) n);
double *a, *b, *c;
// Start timing
clock_t start, finish;
start = clock();
// Allocate space for the vectors to be used
a = new double [n]; b = new double [n]; c = new double [n];
// Define parallel region
// Set up values for vectors a and b
for (int i = 0; i < n; i++){
double angle = 2.0*M_PI*i/ (( double ) n);
a[i] = s*(sin(angle) + cos(angle));
b[i] = s*sin(2.0*angle);
c[i] = 0.0;
}
// Then perform the vector addition
for (int i = 0; i < n; i++){
c[i] += a[i]+b[i];
}
// Compute now the norm-2
double Norm2 = 0.0;
for (int i = 0; i < n; i++){
Norm2 += c[i]*c[i];
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for norm computation=" << timeused << endl;
cout << " Norm-2 = " << Norm2 << endl;
// Free up space
delete[] a;
delete[] b;
delete[] c;
return 0;
}
8.18. Compiling with and without vectorization¶
We can compile and link without vectorization using the clang c++ compiler
clang -o novec.x vecexample.cpp
and with vectorization (and additional optimizations)
clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
The speedup depends on the size of the vectors. In the example here we have run with \(10^7\) elements. The example here was run on an IMac17.1 with OSX El Capitan (10.11.4) as operating system and an Intel i5 3.3 GHz CPU.
Compphys:~ hjensen$ ./vec.x 10000000
Time used for norm computation=0.04720500000
Compphys:~ hjensen$ ./novec.x 10000000
Time used for norm computation=0.03311700000
This particular C++ compiler speeds up the above loop operations with a factor of 1.5 Performing the same operations for \(10^9\) elements results in a smaller speedup since reading from main memory is required. The non-vectorized code is seemingly faster.
Compphys:~ hjensen$ ./vec.x 1000000000
Time used for norm computation=58.41391100
Compphys:~ hjensen$ ./novec.x 1000000000
Time used for norm computation=46.51295300
We will discuss these issues further in the next slides.
8.19. Compiling with and without vectorization using clang¶
We can compile and link without vectorization with clang compiler
clang++ -o -fno-vectorize novec.x vecexample.cpp
and with vectorization
clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
We can also add vectorization analysis, see for example
clang++ -O3 -Rpass-analysis=loop-vectorize -o vec.x vecexample.cpp
or figure out if vectorization was missed
clang++ -O3 -Rpass-missed=loop-vectorize -o vec.x vecexample.cpp
8.20. Automatic vectorization and vectorization inhibitors, criteria¶
Not all loops can be vectorized, as discussed in Intel’s guide to vectorization
An important criteria is that the loop counter \(n\) is known at the entry of the loop.
for (int j = 0; j < n; j++) {
a[j] = cos(j*1.0);
}
The variable \(n\) does need to be known at compile time. However, this variable must stay the same for the entire duration of the loop. It implies that an exit statement inside the loop cannot be data dependent.
8.21. Automatic vectorization and vectorization inhibitors, exit criteria¶
An exit statement should in general be avoided. If the exit statement contains data-dependent conditions, the loop cannot be vectorized. The following is an example of a non-vectorizable loop
for (int j = 0; j < n; j++) {
a[j] = cos(j*1.0);
if (a[j] < 0 ) break;
}
Avoid loop termination conditions and opt for a single entry loop variable \(n\). The lower and upper bounds have to be kept fixed within the loop.
8.22. Automatic vectorization and vectorization inhibitors, straight-line code¶
SIMD instructions perform the same type of operations multiple times. A switch statement leads thus to a non-vectorizable loop since different statemens cannot branch. The following code can however be vectorized since the if statement is implemented as a masked assignment.
for (int j = 0; j < n; j++) {
double x = cos(j*1.0);
if (x > 0 ) {
a[j] = x*sin(j*2.0);
}
else {
a[j] = 0.0;
}
}
These operations can be performed for all data elements but only those elements which the mask evaluates as true are stored. In general, one should avoid branches such as switch, go to, or return statements or if constructs that cannot be treated as masked assignments.
8.23. Automatic vectorization and vectorization inhibitors, nested loops¶
Only the innermost loop of the following example is vectorized
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] += b[i][j];
}
}
The exception is if an original outer loop is transformed into an inner loop as the result of compiler optimizations.
8.24. Automatic vectorization and vectorization inhibitors, function calls¶
Calls to programmer defined functions ruin vectorization. However, calls to intrinsic functions like \(\sin{x}\), \(\cos{x}\), \(\exp{x}\) etc are allowed since they are normally efficiently vectorized. The following example is fully vectorizable
for (int i = 0; i < n; i++) {
a[i] = log10(i)*cos(i);
}
Similarly, inline functions defined by the programmer, allow for vectorization since the function statements are glued into the actual place where the function is called.
8.25. Automatic vectorization and vectorization inhibitors, data dependencies¶
One has to keep in mind that vectorization changes the order of operations inside a loop. A so-called read-after-write statement with an explicit flow dependency cannot be vectorized. The following code
double b = 15.;
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b;
}
is an example of flow dependency and results in wrong numerical results if vectorized. For a scalar operation, the value \(a[i-1]\) computed during the iteration is loaded into the right-hand side and the results are fine. In vector mode however, with a vector length of four, the values \(a[0]\), \(a[1]\), \(a[2]\) and \(a[3]\) from the previous loop will be loaded into the right-hand side and produce wrong results. That is, we have
a[1] = a[0] + b;
a[2] = a[1] + b;
a[3] = a[2] + b;
a[4] = a[3] + b;
and if the two first iterations are executed at the same by the SIMD instruction, the value of say \(a[1]\) could be used by the second iteration before it has been calculated by the first iteration, leading thereby to wrong results.
8.26. Automatic vectorization and vectorization inhibitors, more data dependencies¶
On the other hand, a so-called write-after-read statement can be vectorized. The following code
double b = 15.;
for (int i = 1; i < n; i++) {
a[i-1] = a[i] + b;
}
is an example of flow dependency that can be vectorized since no iteration with a higher value of \(i\) can complete before an iteration with a lower value of \(i\). However, such code leads to problems with parallelization.
8.27. Automatic vectorization and vectorization inhibitors, memory stride¶
For C++ programmers it is also worth keeping in mind that an array notation is preferred to the more compact use of pointers to access array elements. The compiler can often not tell if it is safe to vectorize the code.
When dealing with arrays, you should also avoid memory stride, since this slows down considerably vectorization. When you access array element, write for example the inner loop to vectorize using unit stride, that is, access successively the next array element in memory, as shown here
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] += b[i][j];
}
}
8.28. Memory management¶
The main memory contains the program data
Cache memory contains a copy of the main memory data
Cache is faster but consumes more space and power. It is normally assumed to be much faster than main memory
Registers contain working data only
Modern CPUs perform most or all operations only on data in register
Multiple Cache memories contain a copy of the main memory data
Cache items accessed by their address in main memory
L1 cache is the fastest but has the least capacity
L2, L3 provide intermediate performance/size tradeoffs
Loads and stores to memory can be as important as floating point operations when we measure performance.
8.29. Memory and communication¶
Most communication in a computer is carried out in chunks, blocks of bytes of data that move together
In the memory hierarchy, data moves between memory and cache, and between different levels of cache, in groups called lines
Lines are typically 64-128 bytes, or 8-16 double precision words
Even if you do not use the data, it is moved and occupies space in the cache
Many of these performance features are not captured in most programming languages.
8.30. Measuring performance¶
How do we measure performance? What is wrong with this code to time a loop?
clock_t start, finish;
start = clock();
for (int j = 0; j < i; j++) {
a[j] = b[j]+b[j]*c[j];
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
8.31. Problems with measuring time¶
Timers are not infinitely accurate
All clocks have a granularity, the minimum time that they can measure
The error in a time measurement, even if everything is perfect, may be the size of this granularity (sometimes called a clock tick)
Always know what your clock granularity is
Ensure that your measurement is for a long enough duration (say 100 times the tick)
8.32. Problems with cold start¶
What happens when the code is executed? The assumption is that the code is ready to execute. But
Code may still be on disk, and not even read into memory.
Data may be in slow memory rather than fast (which may be wrong or right for what you are measuring)
Multiple tests often necessary to ensure that cold start effects are not present
Special effort often required to ensure data in the intended part of the memory hierarchy.
8.33. Problems with smart compilers¶
If the result of the computation is not used, the compiler may eliminate the code
Performance will look impossibly fantastic
Even worse, eliminate some of the code so the performance looks plausible
Ensure that the results are (or may be) used.
8.34. Problems with interference¶
Other activities are sharing your processor
Operating system, system demons, other users
Some parts of the hardware do not always perform with exactly the same performance
Make multiple tests and report
Easy choices include
Average tests represent what users might observe over time
8.35. Problems with measuring performance¶
Accurate, reproducible performance measurement is hard
Think carefully about your experiment:
What is it, precisely, that you want to measure?
How representative is your test to the situation that you are trying to measure?
8.36. Thomas algorithm for tridiagonal linear algebra equations¶
8.37. Thomas algorithm, forward substitution¶
The first step is to multiply the first row by \(a_0/b_0\) and subtract it from the second row. This is known as the forward substitution step. We obtain then
and
At this point the simplified equation, with only an upper triangular matrix takes the form
8.38. Thomas algorithm, backward substitution¶
The next step is the backward substitution step. The last row is multiplied by \(c_{N-3}/b_{N-2}\) and subtracted from the second to last row, thus eliminating \(c_{N-3}\) from the last row. The general backward substitution procedure is
and
All that ramains to be computed is the solution, which is the very straight forward process of
8.39. Thomas algorithm and counting of operations (floating point and memory)¶
Operation | Floating Point |
---|---|
Memory Reads | $14(N-2)$ |
Memory Writes | $4(N-2)$ |
Subtractions | $3(N-2)$ |
Multiplications | $3(N-2)$ |
Divisions | $4(N-2)$ |
// Forward substitution
// Note that we can simplify by precalculating a[i-1]/b[i-1]
for (int i=1; i < n; i++) {
b[i] = b[i] - (a[i-1]*c[i-1])/b[i-1];
f[i] = g[i] - (a[i-1]*f[i-1])/b[i-1];
}
x[n-1] = f[n-1] / b[n-1];
// Backwards substitution
for (int i = n-2; i >= 0; i--) {
f[i] = f[i] - c[i]*f[i+1]/b[i+1];
x[i] = f[i]/b[i];
}
8.40. Example: Transpose of a matrix¶
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double **A, **B;
// Allocate space for the two matrices
A = new double*[n]; B = new double*[n];
for (int i = 0; i < n; i++){
A[i] = new double[n];
B[i] = new double[n];
}
// Set up values for matrix A
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
A[i][j] = cos(i*1.0)*sin(j*3.0);
}
}
clock_t start, finish;
start = clock();
// Then compute the transpose
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
B[i][j]= A[j][i];
}
}
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for setting up transpose of matrix=" << timeused << endl;
// Free up space
for (int i = 0; i < n; i++){
delete[] A[i];
delete[] B[i];
}
delete[] A;
delete[] B;
return 0;
}
8.41. Matrix-matrix multiplication¶
This the matrix-matrix multiplication code with plain c++ memory allocation. It computes at the end the Frobenius norm.
#include <cstdlib>
#include <iostream>
#include <cmath>
#include <iomanip>
#include "time.h"
using namespace std; // note use of namespace
int main (int argc, char* argv[])
{
// read in dimension of square matrix
int n = atoi(argv[1]);
double s = 1.0/sqrt( (double) n);
double **A, **B, **C;
// Start timing
clock_t start, finish;
start = clock();
// Allocate space for the two matrices
A = new double*[n]; B = new double*[n]; C = new double*[n];
for (int i = 0; i < n; i++){
A[i] = new double[n];
B[i] = new double[n];
C[i] = new double[n];
}
// Set up values for matrix A and B and zero matrix C
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
double angle = 2.0*M_PI*i*j/ (( double ) n);
A[i][j] = s * ( sin ( angle ) + cos ( angle ) );
B[j][i] = A[i][j];
}
}
// Then perform the matrix-matrix multiplication
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++) {
sum += B[i][k]*A[k][j];
}
C[i][j] = sum;
}
}
// Compute now the Frobenius norm
double Fsum = 0.0;
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++) {
Fsum += C[i][j]*C[i][j];
}
}
Fsum = sqrt(Fsum);
finish = clock();
double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
cout << setiosflags(ios::showpoint | ios::uppercase);
cout << setprecision(10) << setw(20) << "Time used for matrix-matrix multiplication=" << timeused << endl;
cout << " Frobenius norm = " << Fsum << endl;
// Free up space
for (int i = 0; i < n; i++){
delete[] A[i];
delete[] B[i];
delete[] C[i];
}
delete[] A;
delete[] B;
delete[] C;
return 0;
}
8.42. How do we define speedup? Simplest form¶
Speedup measures the ratio of performance between two objects
Versions of same code, with different number of processors
Serial and vector versions
Try different programing languages, c++ and Fortran
Two algorithms computing the same result
8.43. How do we define speedup? Correct baseline¶
The key is choosing the correct baseline for comparison
For our serial vs. vectorization examples, using compiler-provided vectorization, the baseline is simple; the same code, with vectorization turned off
For parallel applications, this is much harder:
Choice of algorithm, decomposition, performance of baseline case etc.
8.44. Parallel speedup¶
For parallel applications, speedup is typically defined as
Speedup \(=T_1/T_p\)
Here \(T_1\) is the time on one processor and \(T_p\) is the time using \(p\) processors.
Can the speedup become larger than \(p\)? That means using \(p\) processors is more than \(p\) times faster than using one processor.
8.45. Speedup and memory¶
The speedup on \(p\) processors can be greater than \(p\) if memory usage is optimal! Consider the case of a memorybound computation with \(M\) words of memory
If \(M/p\) fits into cache while \(M\) does not, the time to access memory will be different in the two cases:
\(T_1\) uses the main memory bandwidth
\(T_p\) uses the appropriate cache bandwidth
8.46. Upper bounds on speedup¶
Assume that almost all parts of a code are perfectly parallelizable (fraction \(f\)). The remainder, fraction \((1-f)\) cannot be parallelized at all.
That is, there is work that takes time \(W\) on one process; a fraction \(f\) of that work will take time \(Wf/p\) on \(p\) processors.
What is the maximum possible speedup as a function of \(f\)?
8.47. Amdahl’s law¶
On one processor we have
On \(p\) processors we have
resulting in a speedup of
As \(p\) goes to infinity, \(fW/p\) goes to zero, and the maximum speedup is
meaning that if if \(f = 0.99\) (all but \(1\%\) parallelizable), the maximum speedup is \(1/(1-.99)=100\)!