Reduction is a class of parallel algorithms. It combines all
the elements in a collection – 0(N) input data and generate 0(1) result. We can
use associative or commutative operation to calculate sum, maximum, minimum…
Reductions are used as important primitive or a subroutine in many operations.
In sequential algorithms. To sum an array. Following steps
might be implemented.
Reduction gives us another approach to this situation. To
understand the algorithms, you should have basic knowledge about parallel
programing as well as the organization of threads, blocks. I suggest you to
refer some sources like : Professional
CUDA programming, CUDA handbook… or see my posts.
In parallel programming, we perform addition in following
method:
1.
Partition input data into smaller chunks
2.
Each thread calculates the partial sum for each
chunk
3.
Add the partial results from each chunk into
final sum.
Take a look to following to get more understanding.
I give you an example of interleaved- pairs. In view of
parallel programming, organization of threads as well as steps as below:
This approach avoids warp divergence that causes poor kernel
performance
To do implementation of this algorithm, we need two global
memory arrays, one for storing entire array to reduce, another for one smaller
array for holding partial sums of each thread block. Above figure show
operation on each thread block.
Imagine your input data will be split into many blocks, for instance
:
Input data = 1000
elements, number of thread per block is 32, so we have 1000/32 ~ 31 blocks. Each
thread block operates on portion of array independently.
Final step, we calculate partial results of each thread
block
Comments
Post a Comment