ICP  1.1.0
 Hosted by GitHub
Classes | Functions
icp_kernels.cl File Reference

Kernels for the ICP pipeline. More...

Classes

struct  dist_id
 Struct holding a value and a key. More...
 

Functions

kernel void getLMs (global float4 *in, global float4 *out)
 Samples a point cloud for landmarks. More...
 
kernel void getReps (global float8 *in, global float8 *out)
 Samples a set of landmarks for representatives. More...
 
kernel void icpComputeReduceWeights (global dist_id *in, global float *weights, global double *sums, local float *data, uint n)
 Computes a set of weights \( \{w_i = \frac{100}{100+\|x_i-x'_i\|_p}\} \), and reduces them to get their sum, \( \sum^n_i{w_i} \). More...
 
kernel void icpComputeReduceWeights_WG (global dist_id *in, global float *weights, global float *sums, local float *data, uint n)
 Computes a set of weights \( \{w_i = \frac{100}{100+\|x_i-x'_i\|_p}\} \), and reduces them to get their sum, \( \sum^n_i{w_i} \). More...
 
kernel void reduce_sum_fd (global float4 *in, global double *out, local double *data, uint n)
 Performs a reduce operation on the columns of an array. More...
 
kernel void icpMean (global float4 *F, global float4 *M, global float4 *mean, local float *data, uint n)
 Performs reduce operations on arrays of 8-D points. More...
 
kernel void icpMean_Weighted (global float4 *F, global float4 *M, global float4 *MEAN, global float *W, constant double *sum_w, local float *data, uint n)
 Performs reduce operations on arrays of 8-D points. More...
 
kernel void icpGMean (global float4 *in, global float4 *out, local float *data, uint n)
 Performs a reduce operation on an array of 4-D points. More...
 
kernel void icpSubtractMean (global float4 *F, global float4 *M, global float4 *DF, global float4 *DM, constant float4 *mean)
 Computes the deviations from the means of the fixed and moving sets of 8-D points. More...
 
kernel void icpSijProducts (global float4 *M, global float4 *F, global float *Sij, uint m, float c)
 Produces the products in the Sij elements of the S matrix. More...
 
kernel void icpSijProducts_Weighted (global float4 *M, global float4 *F, global float *W, global float *Sij, uint m, float c)
 Produces the weighted products in the Sij elements of the S matrix. More...
 
kernel void icpTransform_Quaternion (global float4 *M, global float4 *tM, constant float4 *data)
 Performs a homogeneous transformation on a set of points, \( p = \left[ \begin{matrix} p_x & p_y & p_z & 1 \end{matrix} \right]^T \) (or as a quaternion, \( \dot{p} = \left[ \begin{matrix} p_x & p_y & p_z & 0 \end{matrix} \right]^T \)), using unit quaternions \( \dot{q} = (\omega, \mathcal{v}) = q_w + (q_x i + q_y j + q_z k) = \left[ \begin{matrix} q_x & q_y & q_z & q_w \end{matrix} \right]^T \), where \( \dot{q}\cdot\dot{q}=1 \). More...
 
kernel void icpTransform_Quaternion_2 (global float4 *M, global float4 *tM, constant float4 *data)
 Performs a homogeneous transformation on a set of points, \( p = \left[ \begin{matrix} p_x & p_y & p_z & 1 \end{matrix} \right]^T \) (or as a quaternion, \( \dot{p} = \left[ \begin{matrix} p_x & p_y & p_z & 0 \end{matrix} \right]^T \)), using unit quaternions \( \dot{q} = q_w + q_x i + q_y j + q_z k = \left[ \begin{matrix} q_x & q_y & q_z & q_w \end{matrix} \right]^T \), where \( \dot{q}\cdot\dot{q}=1 \). More...
 
kernel void icpTransform_Matrix (global float4 *M, global float4 *tM, constant float4 *T)
 Performs a homogeneous transformation on a set of points. More...
 
void prod (float4 *N, float4 *x, float4 *x_new)
 Computes a matrix-vector product, \( x_{new}=Nx \). More...
 
kernel void icpPowerMethod (global float *Sij, global float4 *means, global float4 *Tk)
 Computes the quantities that represent the incremental development in the transformation estimation in iteration k. More...
 

Detailed Description

Kernels for the ICP pipeline.

Author
Nick Lamprianidis
Version
1.0
Date
2015
Copyright (c) 2015 Nick Lamprianidis
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Function Documentation

kernel void getLMs ( global float4 *  in,
global float4 *  out 
)

Samples a point cloud for landmarks.

Chooses landmarks at specific intervals in the x and y dimension.

Note
16384 landmarks are extracted from the point cloud \( (640 \times 480) \). These landmarks come from a center area. 10% of the points around the cloud are ignored.
From the center area \( (512 \times 384) \), points are sampled 1:4 in the x dimension and 1:3 in the y dimension. There is also an offset 1 in the x dimension and 1 in the y dimension. This creates an array \( (128 \times 128) \) of landmarks.
Invalid points (zero coordinates) are going to be picked. Further processing is needed for those points to be discraded, if necessary.
The x dimension of the global workspace, \( gXdim \), should be equal to two times the number of landmarks per row (2 work-items per landmark). That is, \( \ gXdim=256 \). The y dimension of the global workspace, \( gYdim \), should be equal to the number of landmarks per column. That is, \( \ gYdim=128 \). There is no requirement for the local workspace.
Parameters
[in]inarray (point cloud) of float8 elements.
[out]outarray (landmarks) of float8 elements.
kernel void getReps ( global float8 *  in,
global float8 *  out 
)

Samples a set of landmarks for representatives.

Chooses representatives at specific intervals in the x and y dimension.

Note
Representatives are extracted from the set of landmarks \( (128 \times 128) \).
Representatives are sampled \( 1:(128/n_{rx}) \) in the x dimension and \( 1:(128/n_{ry}) \) in the y dimension. There is also an offset \( (128/n_{rx})/2-1 \) in the x dimension and \( (128/n_{ry})/2-1 \) in the y dimension. This creates an array \( (n_{rx} \times n_{ry}) \) of representatives.
The x dimension of the global workspace, \( gXdim \), should be equal to the number of representatives per row. That is, \( \ gXdim=n_{rx }\). The y dimension of the global workspace, \( gYdim \), should be equal to the number of representatives per column. That is, \( \ gYdim=n_{ry} \). There is no requirement for the local workspace.
Parameters
[in]inarray (landmarks) of float8 elements.
[out]outarray (representatives) of float8 elements.
kernel void icpComputeReduceWeights ( global dist_id in,
global float *  weights,
global double *  sums,
local float *  data,
uint  n 
)

Computes a set of weights \( \{w_i = \frac{100}{100+\|x_i-x'_i\|_p}\} \), and reduces them to get their sum, \( \sum^n_i{w_i} \).

Takes distances between pairs of points and produces a set of weights.

Note
The number of elements in the input array, n, should be a multiple of 2. The global workspace should be one dimensional, and its x dimension, \( gXdim \), should be greater than or equal to the number of elements in the array, n, divided by 2. That is, \( \ gXdim \geq n/2 \). Each work-item handles 2 dist_id elements. The local workspace should be one dimensional, and its x dimension should be a power of 2. It is recommended to use one wavefront/warp per work-group.
The kernel is aimed to be used when the number of elements in the array is small enough to be handled by a single work-group. It promotes the output data type to double so that a consistent API is maintained.
Parameters
[in]inarray of dist_id elements.
[out]weightsarray with the computed weights (float elements).
[out]sums(reduced) array with the sum. Its size should be \( sizeof\ (double) \).
[in]datalocal buffer. Its size should be 2 float elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (float) \).
[in]nnumber of elements in the array.
kernel void icpComputeReduceWeights_WG ( global dist_id in,
global float *  weights,
global float *  sums,
local float *  data,
uint  n 
)

Computes a set of weights \( \{w_i = \frac{100}{100+\|x_i-x'_i\|_p}\} \), and reduces them to get their sum, \( \sum^n_i{w_i} \).

Takes distances between pairs of points and produces a set of weights.

Note
The number of elements in the input array, n, should be a multiple of 2. The global workspace should be one dimensional, and its x dimension, \( gXdim \), should be greater than or equal to the number of elements in the array, n, divided by 2. That is, \( \ gXdim \geq n/2 \). Each work-item handles 2 dist_id elements. The local workspace should be one dimensional, and its x dimension should be a power of 2. It is recommended to use one wavefront/warp per work-group.
When the number of elements is small enough to be handled by a single work-group, use icpComputeReduceWeights instead.
When the elements are more than what a single work-group can handle, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the sum from each block reduction. A reduction should then be made (by reduce_sum_fd) on those sums for the final result. The number of work-groups in the x dimension, \( wgXdim \), for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0.f, in the output array, since in the next phase the data are going to be handled as float4.
Parameters
[in]inarray of dist_id elements.
[out]weightsarray with the computed weights (float elements).
[out]sums(reduced) array with the sums. Its size should be \( wgXdim*sizeof\ (float) \).
[in]datalocal buffer. Its size should be 2 float elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (float) \).
[in]nnumber of elements in the array.
kernel void icpGMean ( global float4 *  in,
global float4 *  out,
local float *  data,
uint  n 
)

Performs a reduce operation on an array of 4-D points.

Computes the sums \( \bar{x}_j = \sum^{n}_{i}{x_{ij}}, j=\{0,1,2\} \) on the xyz dimensions of the points in the fixed and moving sets.

Note
This kernel is supposed to be used for the reduction of the block means from icpMean.
The global workspace should be one dimensional and its x dimension, \( gXdim \), should be greater or equal to the number of points, n, in the array divided by 2. That is, \( \ gXdim \geq n/2 \). The local workspace should also be one dimensional, and its x dimension should be a power of 2. It is recommended to use one wavefront/warp per work-group.
When the number of points in the array is small enough to be handled by a single work-group, the output array will contain the true means. When the points are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the means from each block reduction. A reduction should then be made on those means for the final result.
Parameters
[in]inarray of float4 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[out]outarray of mean float4 vectors. When the kernel is dispatched with just one work-group, the array contains one vector with the means on the xyz dimensions, and its size should be \( sizeof\ (float4) \). When the kernel is dispatched with more than one work-group, the array contains the means from each block reduction, and its size should be \( wgXdim*sizeof\ (float4) \).
[in]datalocal buffer. Its size should be 6 float elements for each work-item in a work-group. That is \( lXdim*(2*(3*sizeof\ (float))) \).
[in]nnumber of points in the array.
kernel void icpMean ( global float4 *  F,
global float4 *  M,
global float4 *  mean,
local float *  data,
uint  n 
)

Performs reduce operations on arrays of 8-D points.

Computes the means \( \bar{x}_j = \sum^{n}_{i}{\frac{x_{ij}}{n}}, j=\{0,1,2\} \) on the xyz dimensions of the points in the fixed and moving sets.

Note
The kernel normalizes the data, and then performs the reduction.
The number of elements, n, in the arrays should be a multiple of 2 (each work-item loads 2 points). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of points in the arrays divided by 2. That is, \( \ gXdim \geq n/2 \). The y dimension of the global workspace, \( gYdim \), should be equal to 2. That is, \( \ gYdim = 2 \). The local workspace should also be one dimensional, and its x dimension should be a power of 2. It is recommended to use one wavefront/warp per work-group.
When the number of points in the arrays is small enough to be handled by a single work-group, the output arrays will contain the true means. When the points are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the means from each block reduction. A reduction should then be made on those means for the final result.
Parameters
[in]Ffixed set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Mmoving set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[out]meanarray of mean float4 vectors. When the kernel is dispatched with just one work-group per row, the array contains one vector per set with the means on the xyz dimensions. Its size should be \( 2*sizeof\ (float4) \). The first float4 is the mean for the fixed set, and the second float4 is the mean for the moving set. When the kernel is dispatched with more than one work-group, the array contains the means from each block reduction, and its size should be \( 2*(wgXdim*sizeof\ (float4)) \). The first row contains the block means for the fixed set, and the second row contains the block means for the moving set.
[in]datalocal buffer. Its size should be 6 float elements for each work-item in a work-group. That is \( lXdim*(2*(3*sizeof\ (float))) \).
[in]nnumber of points in the sets.
kernel void icpMean_Weighted ( global float4 *  F,
global float4 *  M,
global float4 *  MEAN,
global float *  W,
constant double *  sum_w,
local float *  data,
uint  n 
)

Performs reduce operations on arrays of 8-D points.

Computes the weighted means \( \bar{x}_j = \frac{\sum^{n}_{i}{w_i*x_{ij}}} {\sum^{n}_{i}{w_i}}, j=\{0,1,2\} \) on the xyz dimensions of the points in the fixed and moving sets.

Note
The kernel normalizes the data, and then performs the reduction.
The number of elements, n, in the arrays should be a multiple of 2 (each work-item loads 2 points). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of points in the arrays divided by 2. That is, \( \ gXdim \geq n/2 \). The y dimension of the global workspace, \( gYdim \), should be equal to 2. That is, \( \ gYdim = 2 \). The local workspace should also be one dimensional, and its x dimension should be a power of 2. It is recommended to use one wavefront/warp per work-group.
When the number of points in the arrays is small enough to be handled by a single work-group, the output arrays will contain the true means. When the points are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the means from each block reduction. A reduction should then be made on those means for the final result.
Parameters
[in]Ffixed set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Mmoving set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[out]MEANarray of mean float4 vectors. When the kernel is dispatched with just one work-group per row, the array contains one vector per set with the means on the xyz dimensions. Its size should be \( 2*sizeof\ (float4) \). The first float4 is the mean for the fixed set, and the second float4 is the mean for the moving set. When the kernel is dispatched with more than one work-group, the array contains the means from each block reduction, and its size should be \( 2*(wgXdim*sizeof\ (float4)) \). The first row contains the block means for the fixed set, and the second row contains the block means for the moving set.
[in]Warray with the weights between the pairs of points.
[in]sum_wsum of the weights.
[in]datalocal buffer. Its size should be 6 float elements for each work-item in a work-group. That is \( lXdim*(2*(3*sizeof\ (float))) \).
[in]nnumber of points in the sets.
kernel void icpPowerMethod ( global float *  Sij,
global float4 *  means,
global float4 *  Tk 
)

Computes the quantities that represent the incremental development in the transformation estimation in iteration k.

Uses the Power Method to estimate the unit quaternion \( q_k \) that represents the rotation, and then computes the scale \( s_k \) and translation \( t_k \).

Note
The Power Method is performed on \(N\) and computes the eigenvector corresponding to the maximum magnitude eigenvalue \( \mu \). The eigenvector of interest is the one corresponding to the most positive eigenvalue \(\lambda\). If \( \mu<0 \), the algorithm is executed again on \( N'=N + |\lambda| I \). Then, the eigenvalue \( \lambda \) is \( \mu' - \mu \). The corresponding eigenvector doesn't change.
The kernel should be dispatched as a task (1 work-item).
Parameters
[in]Sijarray (sums of products) of size \(11*sizeof\ (float)\). The first 9 elements (in row major order) are the \(S_k\) matrix, and the next 2 are the numerator and denominator of the scale \(s_k\).
[in]meansarray (fixed and moving set means) of size \(2*sizeof\ (float4)\).
[out]Tkarray of size \( 2 * sizeof\ (float4) \). The first float4 is the unit quaternion \( \dot{q_k} = q_w + q_x i + q_y j + q_z k = \left[ \begin{matrix} q_x & q_y & q_z & q_w \end{matrix} \right]^T \), and the second one is the translation vector \( t_k=\left[ \begin{matrix} t_x & t_y & t_z & 1 \end{matrix} \right]^T \). The scale is placed in the last element of the translation vector. That is, \( t_k = \left[ \begin{matrix} t_x & t_y & t_z & s_k \end{matrix} \right]^T \).
kernel void icpSijProducts ( global float4 *  M,
global float4 *  F,
global float *  Sij,
uint  m,
float  c 
)

Produces the products in the Sij elements of the S matrix.

Multiplies the deviations of corresponding points in the fixed set \( \mathcal{F}_{m \times 3} \) and the moving set \( \mathcal{M}_{m \times 3} \).

Note
Since large sums are built, there is an option to scale the points in order to deal with floating point arithmetic issues. The resulting S matrix will have accordingly scaled eigenvalues. The eigenvectors stay the same. The constituents of the scaling factor s are scaled as well, but the factor itself is not affected.
Each work-item processes up to 4 pairs of points. So, the output of each work-item is partial sums of products. On a next step, these results will have to be reduced to get the final S matrix.
The global workspace should be one dimensional and its x dimension, \( gXdim \), should be greater or equal to the number of points, m, in the sets. That is, \( \ gXdim \geq m \). There is no requirement for the local workspace.
Parameters
[in]Marray (moving set deviations) of float4 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Farray (fixed set deviations) of float4 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[out]Sijarray (partial sums of products). Its number of rows is 11. Its number of columns is \( gXdim \). So, its size should be \( 11 * gXdim * sizeof\ (float) \).
[in]mnumber of points in the sets.
[in]cscaling factor.
kernel void icpSijProducts_Weighted ( global float4 *  M,
global float4 *  F,
global float *  W,
global float *  Sij,
uint  m,
float  c 
)

Produces the weighted products in the Sij elements of the S matrix.

Multiplies the deviations of corresponding points in the fixed set \( \mathcal{F}_{m \times 3} \) and the moving set \( \mathcal{M}_{m \times 3} \).

Note
Since large sums are built, there is an option to scale the points in order to deal with floating point arithmetic issues. The resulting S matrix will have accordingly scaled eigenvalues. The eigenvectors stay the same. The constituents of the scaling factor s are scaled as well, but the factor itself is not affected.
Each work-item processes up to 4 pairs of points. So, the output of each work-item is partial sums of products. On a next step, these results will have to be reduced to get the final S matrix.
The global workspace should be one dimensional and its x dimension, \( gXdim \), should be greater or equal to the number of points, m, in the sets. That is, \( \ gXdim \geq m \). There is no requirement for the local workspace.
Parameters
[in]Marray (moving set deviations) of float4 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Farray (fixed set deviations) of float4 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Warray (weights) of float elements.
[out]Sijarray (partial sums of products). Its number of rows is 11. Its number of columns is \( gXdim \). So, its size should be \( 11 * gXdim * sizeof\ (float) \).
[in]mnumber of points in the sets.
[in]cscaling factor.
kernel void icpSubtractMean ( global float4 *  F,
global float4 *  M,
global float4 *  DF,
global float4 *  DM,
constant float4 *  mean 
)

Computes the deviations from the means of the fixed and moving sets of 8-D points.

Subtracts the means from the 4-D geometric coordinates of the points.

Note
The x dimension of the global workspace, \( gXdim \), should be equal to the number of points in the sets. That is, \( \ gXdim=n \). The y dimension of the global workspace should be equal to 2. That is, \( \ gYdim=2 \). There is no requirement for the local workspace.
Parameters
[in]Ffixed set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[in]Mmoving set of float8 elements. The first 3 dimensions should contain the xyz coordinates of the points.
[out]DFarray of float4 elements (fixed set deviations from the mean). Only the geometric information gets transfered in the output.
[out]DMarray of float4 elements (moving set deviations from the mean). Only the geometric information gets transfered in the output.
[in]meanfixed and moving set means. The first float4 is the fixed set mean, amd the second float is the moving set mean.
kernel void icpTransform_Matrix ( global float4 *  M,
global float4 *  tM,
constant float4 *  T 
)

Performs a homogeneous transformation on a set of points.

Transforms each point in a set, \( p' = Tp = \left[ \begin{matrix} R' & t \\ 0 & 1 \end{matrix} \right]\left[ \begin{matrix} p \\ 1 \end{matrix} \right] = \left[ \begin{matrix} sR & t \\ 0 & 1 \end{matrix} \right]\left[ \begin{matrix} p \\ 1 \end{matrix} \right] = sRp+t \).

Note
It there is a need for scaling, the scaling factor should already be incorporated in the rotation matrix.
The x dimension of the global workspace, \( gXdim \), should be equal to 2. That is, \( \ gXdim = 2 \). The y dimension of the global workspace should be equal to the number of points, m, in the sets. That is, \( \ gYdim = m \). There is no requirement for the local workspace.
Parameters
[in]Marray of float8 elements. The first 4 dimensions should contain the homogeneous coordinates of the points.
[out]tMarray of float8 elements. The first 4 dimensions will contain the transformed homogeneous coordinates of the points.
[in]Tthe transformation matrix of size \( 16 * sizeof\ (float) \). The elements should be laid out in row major order.
kernel void icpTransform_Quaternion ( global float4 *  M,
global float4 *  tM,
constant float4 *  data 
)

Performs a homogeneous transformation on a set of points, \( p = \left[ \begin{matrix} p_x & p_y & p_z & 1 \end{matrix} \right]^T \) (or as a quaternion, \( \dot{p} = \left[ \begin{matrix} p_x & p_y & p_z & 0 \end{matrix} \right]^T \)), using unit quaternions \( \dot{q} = (\omega, \mathcal{v}) = q_w + (q_x i + q_y j + q_z k) = \left[ \begin{matrix} q_x & q_y & q_z & q_w \end{matrix} \right]^T \), where \( \dot{q}\cdot\dot{q}=1 \).

Transforms each point in a set, \(\ p' = s\dot{q}\dot{p}\dot{q}^*+t = s(p + 2\mathcal{v} \times (\mathcal{v} \times p + \omega p)) + t \).

Note
The x dimension of the global workspace, \( gXdim \), should be equal to 2. That is, \( \ gXdim = 2 \). The y dimension of the global workspace should be equal to the number of points, m, in the sets. That is, \( \ gYdim = m \). There is no requirement for the local workspace.
Parameters
[in]Marray of float8 elements. The first 4 dimensions should contain the homogeneous coordinates of the points.
[out]tMarray of float8 elements. The first 4 dimensions will contain the transformed homogeneous coordinates of the points.
[in]dataarray of size \( 2 * sizeof\ (float4) \). The first float4 is the quaternion, and the second is the translation vector. If there is a need to apply scaling, the factor should be available in the last element of the translation vector. That is, \( t = \left[ \begin{matrix} t_x & t_y & t_z & s \end{matrix} \right]^T \).
kernel void icpTransform_Quaternion_2 ( global float4 *  M,
global float4 *  tM,
constant float4 *  data 
)

Performs a homogeneous transformation on a set of points, \( p = \left[ \begin{matrix} p_x & p_y & p_z & 1 \end{matrix} \right]^T \) (or as a quaternion, \( \dot{p} = \left[ \begin{matrix} p_x & p_y & p_z & 0 \end{matrix} \right]^T \)), using unit quaternions \( \dot{q} = q_w + q_x i + q_y j + q_z k = \left[ \begin{matrix} q_x & q_y & q_z & q_w \end{matrix} \right]^T \), where \( \dot{q}\cdot\dot{q}=1 \).

Transforms each point in a set, \(\ p' = s\dot{q}\dot{p}\dot{q}^*+t = s\bar{Q}^TQ\dot{p}+t = s \left[ \begin{matrix} q_w & q_z & -q_y & q_x \\ -q_z & q_w & q_x & q_y \\ q_y & -q_x & q_w & q_z \\ -q_x & -q_y & -q_z & q_w \end{matrix} \right]^T \left[ \begin{matrix} q_w & -q_z & q_y & q_x \\ q_z & q_w & -q_x & q_y \\ -q_y & q_x & q_w & q_z \\ -q_x & -q_y & -q_z & q_w \end{matrix} \right] \left[ \begin{matrix} p_x \\ p_y \\ p_z \\ 0 \end{matrix} \right] + \left[ \begin{matrix} t_x \\ t_y \\ t_z \\ 1 \end{matrix} \right] = s \left[ \begin{matrix} 1-2q_y^2-2q_z^2 & 2(q_xq_y-q_zq_w) & 2(q_xq_z+q_yq_w) & 0 \\ 2(q_xq_y+q_zq_w) & 1-2q_x^2-2q_z^2 & 2(q_yq_z-q_xq_w) & 0 \\ 2(q_xq_z-q_yq_w) & 2(q_yq_z+q_xq_w) & 1-2q_x^2-2q_y^2 & 0 \\ 0 & 0 & 0 & 1 \end{matrix} \right] \left[ \begin{matrix} p_x \\ p_y \\ p_z \\ 0 \end{matrix} \right] + \left[ \begin{matrix} t_x \\ t_y \\ t_z \\ 1 \end{matrix} \right]\).

Note
The x dimension of the global workspace, \( gXdim \), should be equal to 2. That is, \( \ gXdim = 2 \). The y dimension of the global workspace should be equal to the number of points, m, in the sets. That is, \( \ gYdim = m \). There is no requirement for the local workspace.
Parameters
[in]Marray of float8 elements. The first 4 dimensions should contain the homogeneous coordinates of the points.
[out]tMarray of float8 elements. The first 4 dimensions will contain the transformed homogeneous coordinates of the points.
[in]dataarray of size \( 2 * sizeof\ (float4) \). The first float4 is the quaternion, and the second is the translation vector. If there is a need to apply scaling, the factor should be available in the last element of the translation vector. That is, \( t = \left[ \begin{matrix} t_x & t_y & t_z & s \end{matrix} \right]^T \).
void prod ( float4 *  N,
float4 *  x,
float4 *  x_new 
)
inline

Computes a matrix-vector product, \( x_{new}=Nx \).

Parameters
[in]N4x4 matrix (4xfloat4 elements).
[in]xvector (float4 element).
[out]x_newvector (float4 element).
kernel void reduce_sum_fd ( global float4 *  in,
global double *  out,
local double *  data,
uint  n 
)

Performs a reduce operation on the columns of an array.

Computes the sum of the elements of each row in an array.

Note
When there are multiple rows in the array, a reduce operation is performed per row, in parallel.
The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as float4). The x dimension of the global workspace, \( gXdim \), should be greater or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.
When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true sums. When the elements are more than that, they are partitioned into blocks and reduced independently. In this case, the kernel outputs the sum from each block reduction. A reduction should then be made on those sums for the final results. The number of work-groups in the x dimension, \( wgXdim \), for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0.0, in the output array, since in the next phase the data are going to be handled as double4.
Parameters
[in]inarray of float elements.
[out]out(reduced) array of double elements. When the kernel is dispatched with one work-group per row, the array contains the final results, and its size should be \( rows*sizeof\ (double) \). When the kernel is dispatched with more than one work-groups per row, the array contains the results from each block reduction, and its size should be \( wgXdim*rows*sizeof\ (double) \).
[in]datalocal buffer. Its size should be 2 double elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (double) \).
[in]nnumber of elements in a row of the array divided by 4.