Parallel reduction algorithm

Reduction is a class of parallel algorithms. It combines all the elements in a collection – 0(N) input data and generate 0(1) result. We can use associative or commutative operation to calculate sum, maximum, minimum… Reductions are used as important primitive or a subroutine in many operations.

In sequential algorithms. To sum an array. Following steps might be implemented.

Reduction gives us another approach to this situation. To understand the algorithms, you should have basic knowledge about parallel programing as well as the organization of threads, blocks. I suggest you to refer some sources like : Professional CUDA programming, CUDA handbook… or see my posts.

In parallel programming, we perform addition in following method:

1. Partition input data into smaller chunks

2. Each thread calculates the partial sum for each chunk

3. Add the partial results from each chunk into final sum.

Take a look to following to get more understanding.

I give you an example of interleaved- pairs. In view of parallel programming, organization of threads as well as steps as below:

This approach avoids warp divergence that causes poor kernel performance

To do implementation of this algorithm, we need two global memory arrays, one for storing entire array to reduce, another for one smaller array for holding partial sums of each thread block. Above figure show operation on each thread block.

Imagine your input data will be split into many blocks, for instance :

Input data = 1000 elements, number of thread per block is 32, so we have 1000/32 ~ 31 blocks. Each thread block operates on portion of array independently.

Final step, we calculate partial results of each thread block

It might be helpful !

Search This Blog

Parallel reduction algorithm

Comments

Post a Comment