

# PARALLELIZING a CPU CODE to run on GPU



gilles.perrot@univ-fcomte.fr







### Summary

- Sequential code : compilation profiling
- Simple automatic parallelization : openPGI + openACC
- Parallelization with CUDA : from naive to highly optimized
  - Naive reduction run by only one thread.
  - Reduction using shared memory.
  - Hybrid memory usage : registers/shared memory.
  - Intra-warp communication (shuffle instructions).



### CPU code : algorithm

Simple algorithm, based on equation  $\rightarrow$ 

- The outilne of the algorithm is to cut the interval [0; 1] into nsteps slices, considered thin enough.
- We assume that in each slice, the shape under the curve is rectanguar.
- Height is that of the middle point of the slice.
- The greater nsteps, the more accurate the result will be.
- In this case, the computation error is zero when nsteps tends towards infinity.





### An estimation of $\pi$ : the sequential version

Code c++ : pic.cpp

```
int main(int argc, char* argv[])
12
13
14
      long i, nsteps;
15
      double pi, step, sum = 0.0;
      nsteps = 0;
16
17
      if (argc > 1)
18
        nsteps = atol(argv[1]);
      if (nsteps <= 0)
19
        nsteps = 100;
20
21
      step = (1.0)/((double)nsteps);
22
      for (i = 0; i < nsteps; ++i) {</pre>
23
        double x = ((double)i+0.5)*step;
24
        sum += 1.0 / (1.0 + x * x);
25
      }
26
      pi = 4.0 * step * sum;
      cout << std::fixed;</pre>
27
      cout << "pi is " << std::setprecision(17) << pi << "\n";</pre>
28
29
```

**Compilation with gcc (option -Ofast)** \$ q++ -Ofast -o pic pic.cpp



### An estimation of $\pi$ : the sequential version

Performance measurement : Execution with 10<sup>8</sup> and 10<sup>9</sup> intervals



#### Via the Nvidia profiler nvprof

| perrot@cluster2:~/samples-9/0_Simple/pic\$ nvprofcpu-profiling on<br>pi is 3.14159265358997075                                                  | cpu-profiling-mode flat ./pic 1000000000<br>Retrait de première ligne |
|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| ======== CPU profiling result (flat):                                                                                                           | Texte                                                                 |
| 100.00% 1.52s main                                                                                                                              | Titre                                                                 |
| <pre>====== Data collected at 100Hz frequency perrot@cluster2:~/samples-9/0_Simple/pic\$ nvprofcpu-profiling on pi is 3.14159265359042639</pre> | Titre2                                                                |
| ======= CPU profiling result (flat):<br>Time(%) Time Name<br>100.00% 149.93ms main                                                              | Titre principal                                                       |



### An estimation of $\pi$ : the sequential version

Compilation with openPGI compiler : pgc++

\$pgc++ -fast -Minfo=all, intensity, ccff -Minline -o pic pic.cpp

The option -Minfo allows to get some useful feedback before trying to parallelize.

Performance measurement... Not all compilations are equivalents.



At least, results are equals with a precision of  $10^{-12}$ .



- OpenACC can generate hybrid executable code (CPU/GPU).
- Automatic generation is driven by a set of compilation directives.
- The compiler : openPGI (pgc++).
- Before trying any parallelization, it is mandatory to:
  - Use the profiler to identify the more time consuming sequences.
  - Find out, among that sequences, those that could be parallelized (on a GPU).
  - · Determine an appropriate set of ACC directives for each code sequence.

In the sample code pic.cpp, the 'for' loop is a candidate for parallelization .

The typical directive to parallelize a C/C++ loop looks like

#pragma acc parallel loop
And has to be placed just above the target loop

| 20 | <pre>#pragma acc parallel loop</pre>         |
|----|----------------------------------------------|
| 21 | <pre>for (i = 0; i &lt; nsteps; ++i) {</pre> |
| 22 | <pre>double x = ((double)i+0.5)*step;</pre>  |
| 23 | sum += 1.0 / (1.0 + x * x);                  |
| 24 | }                                            |



### Compilation

- The -ta option allows to specify a target for the parallel code.
- In our case, we will target a Nvidia Tesla family GPU, Pascal generation, Titan-X model.
- On the host computer, the sdk release version is cuda9.2.

#### Execution





The compiler output shows that one reduction has been identified (on sum) and that compliant code has then been generated.

However, one can write the corresponding directive more explicitely

| 20 | <pre>#pragma acc parallel loop reduction(+:sum)</pre> |
|----|-------------------------------------------------------|
| 21 | <pre>for (i = 0; i &lt; nsteps; ++i) {</pre>          |
| 22 | double $x = ((double)i+0.5)*step;$                    |
| 23 | sum += 1.0 / (1.0 + x * x);                           |
| 24 | }                                                     |

#### Execution



### Speedup achieved :

- x30 against the sequential code and pgc++ compiler
- x18 against the sequential code and g++ compiler WITHOUT ANY EFFORT (ALMOST)



#### A few comments about openACC

- Even common cases can be difficult to solve.
- Managing memory transfers and persistence can be quite challenging.
- Speedups are more difficult to obtain and less impressive.
- A significant overhead is introduced by the compiler, when
  - Transferring data between CPU and GPU,
  - · Calling fonctions / kernels,
  - · Choosing non optimal grid dimensions.
- The official documentation of openACC directives https://www.openacc.org/sites/default/files/inline-files/OpenACC\_2\_0\_specification.pdf



## CUDA / C parallelization

### Outlines

- Nvidia template as a starting point (0\_Simple/template)
- General structure of a CUDA code:
  - 1. Resource allocations (CPU & GPU).
  - 2. Data loading into host's memory (CPU).
  - 3. Data copy from host memory into GPU memory.
  - 4. Computation of grid dimensions and kernel executions.
  - 5. Copy of the result data from GPU memory to host memory.

#### Estimation of $\pi$

- Simplified structure; no input data to transfer.
- Step by step design and successive refinements.
- Solutions to some performance limitation parameters.



### Idea

- One kernel computes the 1/(1+x<sup>2</sup>) value associated to each slice of the [0; 1] interval.
- Each slice value is computed by one thread.
- The number of slices, *nbsteps*, can be great. We need to check the capacities of the GPU regarding the maximum number of concurrent threads.
- For this purpose, one can use the deviceQuery program

| Device 0: "GeForce GTX TITAN X"                                                      | Sha a sh |
|--------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| CUDA Driver Version / Runtime Version<br>CUDA Capability Major/Minor version number: | 9.2 / 9.0<br>5.2                                                                                               |
| Total amount of global memory:                                                       | 12213 MBytes (12806062080 bytes)                                                                               |
| (24) Multiprocessors, (128) CUDA Cores/MP:                                           | 3072 CUDA Cores                                                                                                |
| GPU Max Clock rate:                                                                  | 1076 MHz (1.08 GHz)                                                                                            |
| riMemory Clock rate: des du GPU quant au                                             | 3505 Mhz                                                                                                       |
| Memory Bus Width:                                                                    | 384-bit                                                                                                        |
| L2 Cache Size:                                                                       | 3145728 bytes                                                                                                  |
| Maximum Texture Dimension Size (x,y,z)                                               | 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)                                                           |
| Maximum Layered 1D Texture Size, (num) layers                                        | 1D=(16384), 2048 layers<br>2D=(16384, 16384), 2048 layers                                                      |
| Maximum Layered 2D Texture Size, (num) layers                                        | 65536 bytes                                                                                                    |
| Total amount of shared memory per block:                                             | 49152 bytes                                                                                                    |
| Total number of registers available per block:                                       |                                                                                                                |
| Warp size:                                                                           | 32                                                                                                             |
| Charlinam number of threads per multiprocessor.                                      | conto Categoria Sortie                                                                                         |
| Maximum number of threads per block:                                                 | 1024                                                                                                           |
| Max dimension size of a thread block (x,y,z):                                        |                                                                                                                |
| Max dimension size of a grid size (x,y,z):                                           | (2147483647, 65535, 65535)                                                                                     |

• *nbsteps* max >2.10<sup>9</sup> x 1024 =  $2.10^{12}$ .



### Kernel code

- The number of concurrent thread is a power of 2, multiple of 32.
- If nbsteps is not a power of 2, it should be passed to the kernel as a parameter.
- If *nbsteps* is a power of 2, its value can be guessed by kernels at runtime.

```
global void
99
     testKernel(DTYPE *out data, unsigned int nsteps)
100
101
102
         // thread id inside thread block and inside grid
103
         const unsigned int tidb = threadIdx.x;
         const unsigned int tid = blockIdx.x*blockDim.x + tidb;
104
105
         // perform some computations
106
107
         DTYPE step;
108
         if (tid < nsteps){</pre>
            step = (1.0)/((DTYPE)nsteps);
109
110
           DTYPE x = ((DTYPE)tid+0.5)*step;
111
           out data[tid] = (DTYPE) 1.0 / (1.0 + x * x);
112
          }
113
```



#### Main.cu source code

- Memory allocation on GPU to store *nbsteps* real values.
- Definition of the computing grid dimensions.
- Kernel execution.

| 143        | unsigned int threads_per_block = 1024;                                       |
|------------|------------------------------------------------------------------------------|
| 144        | unsigned int nbsteps = atoi(argv[1]);                                        |
| 145        | unsigned int mem_size = sizeof(DTYPE) * nbsteps;                             |
| 146        |                                                                              |
| 147        | // allocate device memory                                                    |
| 148        | DTYPE *d_vector;                                                             |
| 149        | <pre>checkCudaErrors(cudaMalloc((void **) &amp;d_vector, mem_size));</pre>   |
| 150        |                                                                              |
| 151        | <pre>// setup execution parameters</pre>                                     |
| 152        | <pre>dim3 grid((nbsteps+threads per block-1)/threads per block, 1, 1);</pre> |
|            |                                                                              |
| 153        | dim3 threads(threads_per_block, 1, 1);                                       |
| 153<br>154 |                                                                              |
|            |                                                                              |



#### **Results & performance**

- The final sum is processed on the CPU.
- The data transfer GPU  $\rightarrow$  CPU is highly time consuming (2300 ms).
- Processing time for slice values : # 150 ms.

```
perrot@cluster2:~/samples-9/0_Simple/ecmDemo_etape_1_vector$ ./template 1000000000
./template Starting...
GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2
nbsteps : 1000000000 - Memory size : 4000000000
Memory total : 12806062080 - Memory free : 12566986752
Processing time: 155.380997 (ms)
Pi ref CPU : 3.1415926538
Pi vector GPU : 3.1415926861
```

#### Conclusion

• The final sum must be computed on the GPU to avoid transferring a large amount of data (on only the overall sum has to be copied in this case).



### Idea

- Modifying the kernel code in order to compute the sum of all the slice values.
- As a starting point, one can add a sequence that allows thread 0 to add up the *nsteps* values previously computed by the kernel and stored in global memory.

### Kernel code

TECHNOLOGIES

```
global
43
             void
44
    testKernel(DTYPE *data, DTYPE *d sum, unsigned int nsteps)
45
    Ł
47
        const unsigned int tidb = threadIdx.x;
        const unsigned int tid = blockIdx.x*blockDim.x + tidb;
        // perform some computations
49
50
        DTYPE step;
        DTYPE sum = 0.0:
51
52
        if (tid < nsteps){</pre>
           step = (1.0)/((DTYPE)nsteps);
53
           DTYPE x = ((DTYPE)tid+0.5)*step;
54
55
           data[tid] = (DTYPE)1.0 / (1.0 + x*x);
57
           syncthreads();
59
        if(tid == 0){
           for(unsigned int i =0; i<nsteps; i++){</pre>
61
             sum += data[i];
62
63
           *d sum = sum;
64
65
```

#### **Results & performance**

- The sum is computed on the GPU.
- No more large vector to transfer GPU  $\rightarrow$  CPU.
- Computing time > 1 minute for 10<sup>9</sup> slices !!!



#### Conclusion

- Only one thread is computing the sum (tid #0, no parallelism).
- Thread #0 waits for all the other threads to complete their tasks before beginning to add up values (\_\_\_\_\_\_\_syncthreads()). It is mandatory to avoid adding up wrong values.
- The slice values needed to compute the sum are first stored in global memory, then they are read from global memory by thread #0, one by one.
- Trying to have a GPU to work like a CPU always lead to a disaster.



### Idea

- Modifying the kernel in order to add up the values more cleverly (with parallelism).
- Storing the values to add up into shared memory, faster than global memory:
  - 48kB of memory shared by thread block are available.
  - Our sum requires 1 double precision value by thread, ie 8 Bytes/thread.
  - A max number of 1024 threads can be define for each block, ie 8 kB/block ⇒ OK (< 48kB).</li>
  - ✓ No possible direct inter-block communication  $\Rightarrow$  partial sums in global memory.
- Two options :
  - 1. Reducing the overall global memory amount required + processing the first summing stage while computing the slice values.
  - 2. Storing all the slice values into global memory and summing afterwards.
- Option #1 is a bit more efficient but less common  $\Rightarrow$  we choose option #2.
- We try to write a scalable kernel, ie. that can be launched with various scales of grid dimensions.

8 blocks 1 block

#### log(n) summing algorithm

- Sums are processed inside each thread block (threads\_per\_bloc).
- Each block sum is then stored into global memory at the very index of the block inside the grid.
- Each kernel execution would divide the size of the vector by threads\_per\_bloc.
- Before each kernel call, suited grid dimensions have to be computed.





TECHNOLOGIES

#### kernel code (summup<<<>>>)

```
62
      global void
63
    summup(DTYPE * data, unsigned int nvalues)
64
    {
65
        const unsigned int tidb = threadIdx.x;
        const unsigned int tid = blockIdx.x*blockDim.x + tidb;
67
        extern shared DTYPE sdata[];
70
        // mandatory if nvalues is not a power of 2
        if (tid < nvalues)</pre>
71
72
           sdata[tidb] = data[tid];
73
        else
           sdata[tidb] = 0.0;
75
          syncthreads();
76
        // loop until all values have been summed up
77
        for (int stride = 1; stride < blockDim.x; stride *=2) {</pre>
78
             if(tidb %(2*stride) == 0)
79
               sdata[tidb] += sdata[tidb + stride];
              syncthreads();
81
82
        // writes the sum of the bloc
83
        if(tidb == 0)
          data[blockIdx.x] = sdata[tidb];
85
86
```

#### Kernel calls summup<<<>>>

| 140 | computeNstore<<< grid, threads, 0 >>> (d_vector, nbsteps);                |
|-----|---------------------------------------------------------------------------|
| 141 | <pre>printf("computeNstore terminated \n");</pre>                         |
| 142 |                                                                           |
| 143 | <pre>int nvalues = nbsteps;</pre>                                         |
| 144 | int nblocks = (nvalues+threads_per_block-1)/threads_per_block;            |
| 145 | <pre>while(nvalues &gt; 1 ){</pre>                                        |
| 146 | printf("vector : %d values - %d block(s) of %d threads\n", nvalues,       |
| ٠   | <pre>nblocks, threads_per_block);</pre>                                   |
| 147 | summup<<< nblocks, threads_per_block, sizeof(DTYPE)*threads_per_block >>> |
| ٠   | (d_vector, nvalues);                                                      |
| 148 | nvalues = nblocks;                                                        |
| 149 | <pre>nblocks = (nvalues+threads_per_block-1)/threads_per_block;</pre>     |
| 150 | }                                                                         |



#### **Results & performance**

- The final sum is stored at index #0 of d\_vector.
- Overall computing time (memcpy included) # 0,40s for 10<sup>9</sup> slices (block size =1024)
- Overall computing time (memcpy included) # 0,27s for 10<sup>9</sup> slices (block size =128)



#### SCIENCES & TECHNOLOGIES

#### Computing the sum inside thread blocks

• No more divergent warps, neither modulo operators







### **Performances summary**

- Estimation of  $\pi$  over 10<sup>9</sup> slices;
- Double precision real computation;
- Memory throughput (D)
  - maximum Titan-X : bus width x clock frequency/2
  - Dmax = 384 x 3505/2 8 = **336 GB/s**
  - D = nbsteps x 8 Bytes / duration

| Version     | Duration | Throughput |
|-------------|----------|------------|
| CPU gcc     | 1440 ms  | 5.5 GB/s   |
| CPU openPGI | 2358 ms  | 3.4 GB/s   |
| GPU openPGI | 270 ms   | 30 GB/s    |
| GPU CUDA v3 | 270 ms   | 30 GB/s    |
| GPU CUDA v4 | 220 ms   | 36.4 GB/s  |



#### Bank conflicts in shared memory

- The shared memory is implemented through 32 4-Byte-width banks.
- Two threads belonging to two different half-warps (#O-15 et #16-31) accessing two different datas of the same bank will result in a bank conflict.

8

-1

-1

-2

-2

0

-2

3

5

5

-2

-5

-3

-3

2

7

7

0

11

11

11

0

2

2

2

During reading stage for stride=1, each thread reads 2 doubles (16 Bytes).

10

1

- Each double spreds out on 2 banks.
- 4-ways bank conflicts cannot be avoided.





Shared memory bank indexes



#### Shared memory bank conflicts

- At writing stage for stride=1, each thread writes 1 double (8 Bytes).
- There's an unused memory space between each written value.
- 2-way bank conflict if nothing is done.



Thread indexes in the warp





#### Shared memory bank conflicts

- For threads #8-15, #24-31, etc. we add an offset of 1 to the memory address.
- New indexe computation
  - $\cdot$  shid = tidb + (tidb>>3)&0x1



Indices des threads du warp





#### Shared memory bank conflicts

- For threads #8-15, #24-31, etc. we add an offset of 1 to the memory address. •
- New indexe computation •
  - $\checkmark$  shid = tidb + (tidb>>3)&0x1



Thread indexes inside the warp



Shared memory bank indexes



#### Shared memory bank conflicts

- When computing in single precision, its relevant to address the shared memory in a contiguous way ⇒ no bank conflicts.
- In our case, it would not bring additional performance (v6).





### **Performance summary**

- Estimation of  $\pi$  over 10<sup>9</sup> slices;
- Double precision real computation;
- Memory throughput (D)
  - maximum Titan-X : bus width x clock frequency/2
  - Dmax = 384 x 3505/2 8 = **336 GB/s**
  - D = nbsteps x 8 Bytes / duration

| Version     | Duration | Throughput |
|-------------|----------|------------|
| CPU gcc     | 1440 ms  | 5.5 GB/s   |
| CPU openPGI | 2358 ms  | 3.4 GB/s   |
| GPU openPGI | 270 ms   | 30 GB/s    |
| GPU CUDA v3 | 270 ms   | 30 GB/s    |
| GPU CUDA v4 | 220 ms   | 36.4 GB/s  |
| GPU CUDA v5 | 200 ms   | 40,0 GB/s  |



#### Idea

- With kernel v5, lot of threads are idle.
- Half of all the threads are idle as soon as the very first iteration.
- Combining the first addition with the data loading into global memory.
- Caution if *nbsteps* is not a multiple of 2.
- Padding the vector size to the next power of 2 (nextPow2() function).

#### Modified kernel code

| 67<br>68<br>69       | <pre>const unsigned int tidb = threadIdx.x;<br/>const unsigned int tid = blockIdx.x*blockDim.x + tidb;<br/>const unsigned int offset = nvalues&gt;&gt;1;</pre> |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 70<br>71<br>72<br>73 | <pre>externshared DTYPE sdata[]; if (tid &lt; offset)</pre>                                                                                                    |
| 74                   | <pre>sdata[SHID(tidb)] = data[tid] + data[tid + offset];</pre>                                                                                                 |
| 75<br>76             | sdata[SHID(tidb)] = 0.0;                                                                                                                                       |



#### Modified kernel calls

| 148 | int nvalues = nbsteps:                                                                 |
|-----|----------------------------------------------------------------------------------------|
| 149 | <pre>nblocks = (nvalues/2+threads_per_block-1)/(threads_per_block);</pre>              |
| 150 | <pre>while(nvalues &gt; 1 ){</pre>                                                     |
| 151 | printf("vector : %d values - %d block(s) of %d threads\n", nvalues,                    |
| •   | <pre>nblocks, threads_per_block);</pre>                                                |
| 152 | <pre>summup&lt;&lt;&lt; nblocks, threads_per_block,</pre>                              |
| ٠   | <pre>sizeof(DTYPE)*(threads_per_block+(threads_per_block&gt;&gt;3)) &gt;&gt;&gt;</pre> |
| •   | (d_vector, nvalues);                                                                   |
| 153 | nvalues = nblocks;                                                                     |
| 154 | <pre>nblocks = (nvalues/2+threads_per_block-1)/threads_per_block;</pre>                |
| 155 | }                                                                                      |



### **Performance summary**

- Estimation of  $\pi$  over 10<sup>9</sup> slices;
- Double precision real computation;
- Memory throughput (D)
  - maximum Titan-X : bus width x clock frequency/2
  - Dmax = 384 x 3505/2 8 = **336 GB/s**
  - D = nbsteps x 8 Bytes / duration

| Version     | Duration | Throughput |
|-------------|----------|------------|
| CPU gcc     | 1440 ms  | 5.5 GB/s   |
| CPU openPGI | 2358 ms  | 3.4 GB/s   |
| GPU openPGI | 270 ms   | 30 GB/s    |
| GPU CUDA v3 | 270 ms   | 30 GB/s    |
| GPU CUDA v4 | 220 ms   | 36.4 GB/s  |
| GPU CUDA v5 | 200 ms   | 40,0 GB/s  |
| GPU CUDA v7 | 190 ms   | 42,1 GB/s  |



### Idea

- Iterations after iteration, less and less threads are active.
- When stride  $\leq$  32, only one warp is still active.
- Inside a warp, as all instructions are executed synchronously
  - ⇒\_syncthreads() is now useless
  - ⇒ if(tidb < stride) is now useless

### Modified kernel code

| 89 | for (int stride = blockDim.x/2; stride >32 stride >>=1) {   |
|----|-------------------------------------------------------------|
| 90 | if(tidb < stride)                                           |
| 91 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb + stride)];</pre> |
| 92 | syncthreads();                                              |
| 93 |                                                             |
| 94 | if (tidb < 32) warpReduce(sdata, tidb);                     |

| 64 | device void warpReduce(volatile DTYPE * sdata, int tidb){ |
|----|-----------------------------------------------------------|
| 65 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb +32)];</pre>    |
| 66 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb +16)];</pre>    |
| 67 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb + 8)];</pre>    |
| 68 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb + 4)];</pre>    |
| 69 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb + 2)];</pre>    |
| 70 | <pre>sdata[SHID(tidb)] += sdata[SHID(tidb + 1)];</pre>    |
| 71 | }                                                         |



### **Performance summary**

- Estimation of  $\pi$  over 10<sup>9</sup> slices;
- Double precision real computation;
- Memory throughput (D)
  - maximum Titan-X : bus width x clock frequency/2
  - Dmax = 384 x 3505/2 8 = **336 GB/s**
  - D = nbsteps x 8 Bytes / duration

| Version        | Duration | Throughput |
|----------------|----------|------------|
| CPU gcc        | 1440 ms  | 5.5 GB/s   |
| CPU openPGI    | 2358 ms  | 3.4 GB/s   |
| GPU openPGI    | 270 ms   | 30 GB/s    |
| GPU CUDA v3    | 270 ms   | 30 GB/s    |
| GPU CUDA v4    | 220 ms   | 36.4 GB/s  |
| GPU CUDA v5    | 200 ms   | 40,0 GB/s  |
| GPU CUDA v7/v8 | 190 ms   | 42,1 GB/s  |



#### Idea

• Optimising the kernel <<<computeNstore>>> by using specific instructions.

| 97  | <pre>computeNstore(DTYPE *out_data, unsigned int nsteps, DTYPE step)</pre> |
|-----|----------------------------------------------------------------------------|
| 98  | {                                                                          |
| 99  | <pre>const unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;</pre>   |
| 100 |                                                                            |
| 101 | if (tid < nsteps)                                                          |
| 102 | <pre>out_data[tid] =drcp_rn(1.0 +dmul_rn(threadIdx.x+0.5,step));</pre>     |
| 103 | }                                                                          |

- Profiler  $\Rightarrow$  93 ms with 512 threads/block (almost no impact).
- The fastest memory on a GPU chip is represented by its registers.
- GPU registers:
  - Limited amount : 255 per thread, 64K x 4Bytes max per block.
  - Maximal speed : no latency, no bank conflicts.
  - ✓ No possible indexing  $\Rightarrow$  no loops on array elements.
- Intra-warp register instructions:
  - \_\_\_\_shfl\_\_sync,
  - \_\_\_\_shfl\_up\_sync,
  - \_\_\_\_\_shfl\_down\_sync,
  - \_\_\_\_\_shfl\_\_xor\_\_sync



#### Warp level instructions Example 1

```
global__ void warpReduce_demo() {
 97
          int laneId = threadIdx.x & 0x1f;
          // Seed starting value as inverse lane ID
 98
          int value = 31 - laneId;
 99
100
101
          int i=16;
          //for (int i=warpSize/2; i>0; i/=2)
102
103
              value += shfl down sync(0xffffffff, value, i);
104
          // "value" now contains the sum across all threads
105
          printf("Thread %d final value = %d\n", threadIdx.x, value);
107
      }
108
109
     int main(int argc, char **argv)
110
111
       warpReduce demo<<< 1, 32 >>>();
112
       cudaDeviceSynchronize();
113
       return 0;
114
```



#### Warp level instructions Example1

tid+offset<32 ⇒ returns tid+offset

tid+offset >31 ⇒ returns tid

|          | Thread 0 final value = 31  |                                    | 0   | final value = 46   |
|----------|----------------------------|------------------------------------|-----|--------------------|
|          | Thread 1 final value = 30  |                                    | 1   | final value = 44   |
|          | Thread 2 final value = 29  | Thread                             | 12  | final value = 42   |
|          | Thread 3 final value = 28  |                                    | 13  | final value = 40   |
| ~        | Thread 4 final value = 27  |                                    | 4   | final value = 38   |
| 2        | Thread 5 final value = 26  |                                    | 15  | final value = 36   |
|          | Thread 6 final value = 25  | Thread                             | 6 1 | final value = 34   |
|          | Thread 7 final value = 24  | ////////////////////////////////// | 17  | final value = 32   |
|          | Thread 8 final value = 23  | /// 🔁 Thread                       | 8 1 | final value = 30   |
|          | Thread 9 final value = 22  |                                    |     | final value = 28   |
|          | Thread 10 final value = 21 |                                    |     | ) final value = 26 |
|          | Thread 11 final value = 20 |                                    |     | l final value = 24 |
|          | Thread 12 final value = 19 |                                    |     | 2 final value = 22 |
|          | Thread 13 final value = 18 |                                    |     | 3 final value = 20 |
|          | Thread 14 final value = 17 |                                    |     | final value = 18   |
|          | Thread 15 final value = 16 |                                    |     | 5 final value = 16 |
|          | Thread 16 final value = 15 |                                    |     | 6 final value = 30 |
|          | Thread 17 final value = 14 |                                    |     | / final value = 28 |
|          | Thread 18 final value = 13 |                                    |     | 3 final value = 26 |
|          | Thread 19 final value = 12 |                                    |     | ) final value = 24 |
|          | Thread 20 final value = 11 |                                    |     | ) final value = 22 |
|          | Thread 21 final value = 10 |                                    |     | l final value = 20 |
| 31       | Thread 22 final value = 9  |                                    |     | final value = 18   |
| <u> </u> | Thread 23 final value = 8  |                                    |     | 3 final value = 16 |
|          | Thread 24 final value = 7  |                                    |     | final value = 14   |
|          | Thread 25 final value = 6  |                                    |     | 5 final value = 12 |
|          | Thread 26 final value = 5  |                                    |     | i final value = 10 |
|          | Thread 27 final value = 4  |                                    |     | final value = 8    |
|          | Thread 28 final value = 3  |                                    |     | 3 final value = 6  |
|          | Thread 29 final value = 2  |                                    |     | ) final value = 4  |
|          | Thread 30 final value = 1  |                                    |     | ) final value = 2  |
|          | Thread 31 final value = 0  | Thread                             | 131 | l final value = 0  |



#### Warp level instructions

Example 1 : sum over one warp







#### Warp level instructions

Example 2 : sum over one block

\_device\_\_ function processing the sum over one warp

| 44 | inlinedevice                                                         |
|----|----------------------------------------------------------------------|
| 45 | <pre>DTYPE warpReduceSum(DTYPE val) {</pre>                          |
| 46 | <pre>for (int offset = warpSize/2; offset &gt; 0; offset /= 2)</pre> |
| 47 | <pre>val +=shfl_down_sync(0xffffffff, val, offset);</pre>            |
| 48 | return val;                                                          |
| 49 | }                                                                    |

Each thread calling this function would execute the \_\_shfl\_down\_sync() instructions and then would return its own value stored in val.



FCHNOLOGIES

#### Warp level instructions Example 2 : sum over one block device function processing the sum over one block 54 inline device DTYPE blockReduceSum(DTYPE val) { 55 To store warp sums 56 (max 32 in each block). static shared DTYPE sdata[32]; 57 int lane = threadIdx.x % warpSize; 58 Warp id in the block =wid, int wid = threadIdx.x / warpSize; 59 Thread id in the warp =lane. 61 val = threadIdx.x; Just for demo, one puts values \_\_\_\_\_syncthreads(); 62 63 in val. val = warpReduceSum(val); 64 Sums inside warps. 65 if (lane==0) sdata[wid]=val; Sum is held by thread #0 of the 67 syncthreads(); warp (lane #0). 69 val = (threadIdx.x < blockDim.x / warpSize) ? sdata[lane] : 0;</pre> 70 71 Set-upvalues for second stage. if (wid==0) 72 If blockDim <1024, 0-padding. val = warpReduceSum(val); 73 75 return val; Warp sum lane #0 = block sum 76

#### Warp level instructions

Example 2 : sum over one block

Kernel processing the sum over one block

| 110 | global void blockReduce_demo() {                                     |
|-----|----------------------------------------------------------------------|
| 111 | <pre>int sum = 0;</pre> Kernel calling the function.                 |
| 112 | <pre>sum = blockReduceSum(sum);</pre>                                |
| 113 | <pre>if (threadIdx.x==0)</pre>                                       |
| 114 | <pre>printf("Thread %d final value = %d\n", threadIdx.x, sum);</pre> |
| 115 | }                                                                    |
| 116 |                                                                      |
| 117 | int main(int argc, char **argv)                                      |
| 118 | {                                                                    |
| 119 | blockReduce_demo<<< 1, 1024 >>>();                                   |
| 120 | <pre>cudaDeviceSynchronize();</pre>                                  |
| 121 |                                                                      |
| 122 | } block of 1024 threads.                                             |

Résultat

perrot@cluster2:~/samples-9/0\_Simple/ecmDemo\_shuffle\$ ./template
Thread 0 final value = 523776

Correct result.



#### Warp level instructions

Processing the sum over a vector of values



La somme globale est obtenue en deux temps

| 82 | <pre>int threads = 512;</pre>                                                         |
|----|---------------------------------------------------------------------------------------|
| 83 | int blocks = min((N + threads - 1) / threads, 1024); - Max 1024 blocks                |
| 84 | deviceReduceKernel<< <blocks, threads="">&gt;&gt;(in, out, N); ← Block sums</blocks,> |
| 85 | deviceReduceKernel<<<1, 1024>>>(out, out, blocks); 🗕 Sum of sums                      |



# Warp level instructions Estimation of $\pi$



./template Starting...

```
GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2
nbsteps : 1073741824 = 10^9 - Memory size : 8589934592
Memory total : 12806062080 - Memory free : 12566986752
Processing time: 118.502998 (ms)
Pi GPU : 3.1415926536
Pi ref CPU : 3.1415926536
TEST PASSED.... (err 0.000189)x10^-9
```



### **Performance summary**

- Estimation of  $\pi$  over 10<sup>9</sup> slices;
- Double precision real computation;
- Memory throughput (D)
  - maximum Titan-X : bus width x clock frequency/2
  - Dmax = 384 x 3505/2 8 = **336 GB/s**
  - D = nbsteps x 8 Bytes / duration

| Version        | Duration      | Throughput |
|----------------|---------------|------------|
| CPU gcc        | 1440 ms       | 5.5 Go/s   |
| CPU openPGI    | 2358 ms       | 3.4 Go/s   |
| GPU openPGI    | 270 ms        | 30 Go/s    |
| GPU CUDA v3    | 270 ms        | 30 Go/s    |
| GPU CUDA v4    | 220 ms        | 36.4 Go/s  |
| GPU CUDA v5    | 200 ms        | 40,0 Go/s  |
| GPU CUDA v7/v8 | 190 ms        | 42,1 Go/s  |
| GPU CUDA v9    | <b>118 ms</b> | 67,8 Go/s  |



### Connecting to IUTBM cluster nodes

• ssh login@cluster2 (Or cluster1 Or cluster3)

If needed, first log in on slayer with

• ssh login@slayer.iut-bm.univ-fcomte.fr

For more convenience, edit your ~/.ssh/config and add some useful things:  ${\tt Host}$  \*

```
ForwardAgent yes
ForwardX11 no
ForwardX11Trusted yes
Port 22
Protocol 2
ServerAliveInterval 60
ServerAliveCountMax 30
Host cluster2
Hostname cluster2
User login
ProxyCommand ssh -W %h:%p login@slayer
```

