Kernels for performing the Reduce operation. More...

Functions
kernel void	reduce_min_f (global float4 in, global float out, local float *data, uint n)
	Performs a reduce operation on the columns of an array. More...

kernel void	reduce_max_ui (global uint4 in, global uint out, local uint *data, uint n)
	Performs a reduce operation on the columns of an array. More...

kernel void	reduce_sum_f (global float4 in, global float out, local float *data, uint n)
	Performs a reduce operation on the columns of an array. More...

Detailed Description

Kernels for performing the Reduce operation.

Author: Nick Lamprianidis

Version: 1.0

Date: 2015

Copyright: The MIT License (MIT)

: Copyright (c) 2015 Nick Lamprianidis

: Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Function Documentation

kernel void reduce_max_ui	(	global uint4 *	in,
		global uint *	out,
		local uint *	data,
		uint	n
	)

Performs a reduce operation on the columns of an array.

Computes the maximum element for each row in an array.

Note: When there are multiple rows in the array, a reduce operation is performed per row, in parallel.; The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as uint4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 uint (= 2 uint4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.; When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true maximums. When the elements are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the maximums from each block reduction. A reduction should then be made on those maximums for the final results. The number of work-groups in the x dimension, \( wgXdim \), for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0, in the output array, since in the next phase the data are going to be handled as uint4.

Parameters

[in]	in	input array of `uint` elements.
[out]	out	(reduced) output array of `uint` elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rowssizeof\ (uint) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdimrows*sizeof\ (uint) \).
[in]	data	local buffer. Its size should be `2 uint` elements for each work-item in a work-group. That is \( 2lXdimsizeof\ (uint) \).
[in]	n	number of elements in a row of the array divided by 4.

kernel void reduce_min_f	(	global float4 *	in,
		global float *	out,
		local float *	data,
		uint	n
	)

Performs a reduce operation on the columns of an array.

Computes the minimum element for each row in an array.

Note: When there are multiple rows in the array, a reduce operation is performed per row, in parallel.; The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as float4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.; When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true minimums. When the elements are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the minimums from each block reduction. A reduction should then be made on those minimums for the final results. The number of work-groups in the x dimension, \( wgXdim \), for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, INFINITY, in the output array, since in the next phase the data are going to be handled as float4.

Parameters

[in]	in	input array of `float` elements.
[out]	out	(reduced) output array of `float` elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rowssizeof\ (float) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdimrows*sizeof\ (float) \).
[in]	data	local buffer. Its size should be `2 float` elements for each work-item in a work-group. That is \( 2lXdimsizeof\ (float) \).
[in]	n	number of elements in a row of the array divided by 4.

kernel void reduce_sum_f	(	global float4 *	in,
		global float *	out,
		local float *	data,
		uint	n
	)

Performs a reduce operation on the columns of an array.

Computes the sum of the elements of each row in an array.

Note: When there are multiple rows in the array, a reduce operation is performed per row, in parallel.; The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as float4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.; When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true sums. When the elements are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the sum from each block reduction. A reduction should then be made on those sums for the final results. The number of work-groups in the x dimension, \( wgXdim \), for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0.f, in the output array, since in the next phase the data are going to be handled as float4.

Parameters

[in]	in	input array of `float` elements.
[out]	out	(reduced) output array of `float` elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rowssizeof\ (float) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdimrows*sizeof\ (float) \).
[in]	data	local buffer. Its size should be `2 float` elements for each work-item in a work-group. That is \( 2lXdimsizeof\ (float) \).
[in]	n	number of elements in a row of the array divided by 4.

Functions

Detailed Description

Function Documentation