Performance poor because we insisted on keeping track of the maxval and location during the execution of the loop.
Idea: Have each thread find the maxloc in its own data, then combine and use temporary arrays indexed by thread number to hold the values found by each thread