Computational Physics Lectures: How to optimize codes, from vectorization to parallelization
Contents
Content
Optimization and profiling
More on optimization
Optimization and profiling
Optimization and debugging
Other hints
Vectorization and the basic idea behind parallel computing
A rough classification of hardware models
Shared memory and distributed memory
Different parallel programming paradigms
Different parallel programming paradigms
What is vectorization?
Number of elements that can acted upon
Number of elements that can acted upon, examples
Number of elements that can acted upon, examples
"A simple test case with and without vectorization":"https://github.com/CompPhysics/ComputationalPhysicsMSU/blob/master/doc/Programs/LecturePrograms/programs/Classes/cpp/program7.cpp"
Compiling with and without vectorization
Automatic vectorization and vectorization inhibitors, criteria
Automatic vectorization and vectorization inhibitors, exit criteria
Automatic vectorization and vectorization inhibitors, straight-line code
Automatic vectorization and vectorization inhibitors, nested loops
Automatic vectorization and vectorization inhibitors, function calls
Automatic vectorization and vectorization inhibitors, data dependencies
Automatic vectorization and vectorization inhibitors, more data dependencies
Automatic vectorization and vectorization inhibitors, memory stride
Compiling with and without vectorization
Compiling with and without vectorization using clang
Memory management
Memory and communication
Measuring performance
Problems with measuring time
Problems with cold start
Problems with smart compilers
Problems with interference
Problems with measuring performance
Thomas algorithm for tridiagonal linear algebra equations
Thomas algorithm, forward substitution
Thomas algorithm, backward substitution
Thomas algorithm and counting of operations (floating point and memory)
"The specialized Thomas algorithm (Project 1)":"https://github.com/CompPhysics/ComputationalPhysics/blob/master/doc/Projects/2016/Project1/Examples/TridiagonalTiming.cpp"
"Example: Transpose of a matrix":"https://github.com/CompPhysics/ComputationalPhysicsMSU/blob/master/doc/Programs/LecturePrograms/programs/Classes/cpp/program8.cpp"
"Matrix-matrix multiplication":"https://github.com/CompPhysics/ComputationalPhysicsMSU/blob/master/doc/Programs/LecturePrograms/programs/Classes/cpp/program9.cpp"
How do we define speedup? Simplest form
How do we define speedup? Correct baseline
Parallel speedup
Speedup and memory
Upper bounds on speedup
Amdahl's law
How much is parallelizable
Today's situation of parallel computing
Overhead present in parallel computing
Parallelizing a sequential algorithm
Strategies
How do I run MPI on a PC/Laptop? MPI
Can I do it on my own PC/laptop? OpenMP installation
Installing MPI
Installing MPI and using Qt
Using "Smaug":"http://comp-phys.net/cluster-info/using-smaug/", the CompPhys computing cluster
What is OpenMP
Getting started, things to remember
OpenMP syntax
Different OpenMP styles of parallelism
General code structure
Parallel region
Hello world, not again, please!
Hello world, yet another variant
Important OpenMP library routines
Private variables
Master region
Parallel for loop
Parallel computations and loops
Scheduling of loop computations
Example code for loop scheduling
Example code for loop scheduling, guided instead of dynamic
More on Parallel for loop
What can happen with this loop?
Inner product
Different threads do different tasks
Single execution
Coordination and synchronization
Data scope
Some remarks
Parallelizing nested for-loops
Nested parallelism
Parallel tasks
Common mistakes
Not all computations are simple
Not all computations are simple, competing threads
How to find the max value using OpenMP
Then deal with the race conditions
What can slow down OpenMP performance?
What can slow down OpenMP performance?
Find the max location for each thread
Combine the values from each thread
"Matrix-matrix multiplication":"https://github.com/CompPhysics/ComputationalPhysicsMSU/blob/master/doc/Programs/ParallelizationOpenMP/OpenMPvectornorm.cpp"
"Matrix-matrix multiplication":"https://github.com/CompPhysics/ComputationalPhysicsMSU/blob/master/doc/Programs/ParallelizationOpenMP/OpenMPmatrixmatrixmult.cpp"
What is Message Passing Interface (MPI)?
Going Parallel with MPI
MPI is a library
Bindings to MPI routines
Communicator
Some of the most important MPI functions
"The first MPI C/C++ program":"https://github.com/CompPhysics/ComputationalPhysics2/blob/gh-pages/doc/Programs/LecturePrograms/programs/MPI/chapter07/program2.cpp"
The Fortran program
Note 1
"Ordered output with MPIBarrier":"https://github.com/CompPhysics/ComputationalPhysics2/blob/gh-pages/doc/Programs/LecturePrograms/programs/MPI/chapter07/program3.cpp"
Note 2
"Ordered output":"https://github.com/CompPhysics/ComputationalPhysics2/blob/gh-pages/doc/Programs/LecturePrograms/programs/MPI/chapter07/program4.cpp"
Note 3
Note 4
"Numerical integration in parallel":"https://github.com/CompPhysics/ComputationalPhysics2/blob/gh-pages/doc/Programs/LecturePrograms/programs/MPI/chapter07/program6.cpp"
Dissection of trapezoidal rule with \( MPI\_reduce \)
Dissection of trapezoidal rule
Integrating with
MPI
How do I use \( MPI\_reduce \)?
More on \( MPI\_Reduce \)
Dissection of trapezoidal rule
Dissection of trapezoidal rule
OpenMP syntax
Mostly directives
#pragma omp construct [ clause ...]
Some functions and types
#include
<omp.h>
Most apply to a block of code
Specifically, a
structured block
Enter at top, exit at bottom only, exit(), abort() permitted
«
1
...
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
...
119
»