Kernels for performing the Reduce
operation.
More...
Functions | |
kernel void | reduce_min_f (global float4 *in, global float *out, local float *data, uint n) |
Performs a reduce operation on the columns of an array. More... | |
kernel void | reduce_max_ui (global uint4 *in, global uint *out, local uint *data, uint n) |
Performs a reduce operation on the columns of an array. More... | |
kernel void | reduce_sum_f (global float4 *in, global float *out, local float *data, uint n) |
Performs a reduce operation on the columns of an array. More... | |
Kernels for performing the Reduce
operation.
kernel void reduce_max_ui | ( | global uint4 * | in, |
global uint * | out, | ||
local uint * | data, | ||
uint | n | ||
) |
Performs a reduce operation on the columns of an array.
Computes the maximum element for each row in an array.
N
, in a row of the array should be a multiple of 4 (the data are handled as uint4
). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 uint
(= 2 uint4
) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M
, in the array. That is, \( \ gYdim = M \). The local workspace should be 1
in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp
per work-group. 0
, in the output array, since in the next phase the data are going to be handled as uint4
.[in] | in | input array of uint elements. |
[out] | out | (reduced) output array of uint elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rows*sizeof\ (uint) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdim*rows*sizeof\ (uint) \). |
[in] | data | local buffer. Its size should be 2 uint elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (uint) \). |
[in] | n | number of elements in a row of the array divided by 4. |
kernel void reduce_min_f | ( | global float4 * | in, |
global float * | out, | ||
local float * | data, | ||
uint | n | ||
) |
Performs a reduce operation on the columns of an array.
Computes the minimum element for each row in an array.
N
, in a row of the array should be a multiple of 4 (the data are handled as float4
). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float
(= 2 float4
) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M
, in the array. That is, \( \ gYdim = M \). The local workspace should be 1
in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp
per work-group. INFINITY
, in the output array, since in the next phase the data are going to be handled as float4
.[in] | in | input array of float elements. |
[out] | out | (reduced) output array of float elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rows*sizeof\ (float) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdim*rows*sizeof\ (float) \). |
[in] | data | local buffer. Its size should be 2 float elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (float) \). |
[in] | n | number of elements in a row of the array divided by 4. |
kernel void reduce_sum_f | ( | global float4 * | in, |
global float * | out, | ||
local float * | data, | ||
uint | n | ||
) |
Performs a reduce operation on the columns of an array.
Computes the sum of the elements of each row in an array.
N
, in a row of the array should be a multiple of 4 (the data are handled as float4
). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float
(= 2 float4
) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M
, in the array. That is, \( \ gYdim = M \). The local workspace should be 1
in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp
per work-group. 0.f
, in the output array, since in the next phase the data are going to be handled as float4
.[in] | in | input array of float elements. |
[out] | out | (reduced) output array of float elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rows*sizeof\ (float) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdim*rows*sizeof\ (float) \). |
[in] | data | local buffer. Its size should be 2 float elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (float) \). |
[in] | n | number of elements in a row of the array divided by 4. |