Computational Physics Lectures: How to optimize codes, from vectorization to parallelization

Compiling with and without vectorization

clang -o novec.x vecexample.cpp

and with vectorization (and additional optimizations)

clang++ -O3 -Rpass=loop-vectorize -o  vec.x vecexample.cpp

The speedup depends on the size of the vectors. In the example here we have run with

$10^7$ elements. The example here was run on an IMac17.1 with OSX El Capitan (10.11.4) as operating system and an Intel i5 3.3 GHz CPU.

Compphys:~ hjensen$ ./vec.x 10000000
Time used  for norm computation=0.04720500000
Compphys:~ hjensen$ ./novec.x 10000000
Time used  for norm computation=0.03311700000

This particular C++ compiler speeds up the above loop operations with a factor of 1.5 Performing the same operations for

$10^9$ elements results in a smaller speedup since reading from main memory is required. The non-vectorized code is seemingly faster.

Compphys:~ hjensen$ ./vec.x 1000000000
Time used  for norm computation=58.41391100
Compphys:~ hjensen$ ./novec.x 1000000000
Time used  for norm computation=46.51295300

We will discuss these issues further in the next slides.