C# Advanced GPU Programming

This section discusses the following advanced C# GPU programming features:

Controlling Automatic Memory Management


Alea GPU automatic memory management system handles memory allocation and data movement. An introduction is in the GPU Programming chapter. Here we provide additional details how you can further control the automatic memory management with nesting and scoping to reduce potential performance impact of code instrumentation.

Automatic memory management is turned on as follows:

  1. Reference the Alea.Fody assembly, which adds a post-compilation step that modifies the assembly by inserting memory synchronization instructions.
  2. Add the attribute [GpuManaged] to the block of code that should be automatically managed.

The attribute [GpuManaged] defines the scope for automatic memory management. Inside a block of code that is attributed with [GpuManaged] all array arguments of a kernel are copied to the GPU if necessary. This is the case if the arrays are outdated on the device or never been copied to the device. Similarly, all results are copied back to the host if necessary, which means they are used again on the host or are out of date on the host. If another method is called that is not instrumented, then all monitored arrays that are out of date will be copied back to the host before the method call.

The attribute [GpuManaged] can be nested to increase the scope. The Alea GPU functions Gpu.Launch and Gpu.For also have the attribute [GpuManaged]. Hence adding [GpuManaged] to a block of code that calls a kernel or Gpu.For is just nesting.

Note that it is not possible to extend the automatic memory management to third-party libraries. In this case the automatic memory management system detects that array arguments passed in from third-party callers are all unmanaged and therefore are copied to the GPU right before a kernel call and copied back to the host after the kernel call because we have to assume they are used again on the host.

Alea GPU automatic memory management system is carefully optimized in order to limit the runtime penalty in the host code. However, to avoid unnecessary memory transfers and performance penalties the user should be aware of how the automatic memory management works. For example you should try to minimize array access from the host side if they are later used again in GPU kernels. If this runtime penalty is not acceptable you can always switch to explicit memory management or using the device memory directly.

The Alea GPU automatic memory management does not rely on the CUDA unified memory technology. Alea GPU automatic memory management migrates the complete array, whereas unified memory migrates individual memory pages. Alea GPU also supports unified memory. It is discussed in the section on unified memory.

Explicit Memory Management


For performance critical applications or complex applications the automatic memory management may be not efficient enough. In this case you can resort to explicit memory management to control the allocation and data movement.

Explicitly Managing GPU Device Memory with .NET Arrays

Alea GPU provides generic allocation and copy function to work with .NET arrays in GPU kernels. The member function

T[] Gpu.Allocate<T>(int length)  

allocates an array of the requested length on the given GPU, whereas

T[] Gpu.Allocate<T>(T[] array)

allocates an array on a GPU and copies the content of an existing .NET array to the GPU. The corresponding functions for 2D arrays work similar.

Note:

The array returned by Gpu.Allocate is a just a handle to an array on a GPU and has length 0.

Explicitly allocated arrays can be freed again with Gpu.Free(array). Explicitly allocated Gpu arrays will be deallocated by the garbage collector once they are not used anymore. However, this might be not early enough for new kernel launches that need the memory.

Note:

Explicitly freeing explicitly allocated GPU arrays helps to reduce GPU memory pressure.

The static member function

T[] Gpu.CopyToHost<T>(T[] src)

copies a an array on a GPU back to the CPU, creating a new array instance. It is not necessary to specify the GPU on which the array src resides because the underlying memory management keeps track of it. Here is an example.

var array = Gpu.Default.Allocate<long>(Length);

Assert.AreEqual(0, array.Length);

Note that array.Length returns 0 because in CPU code an explicitly allocated GPU array is represented as a fake array of length 0. For this reason we provide additional static member functions to get the length and shape of explicitly allocated GPU arrays:

Assert.AreEqual(Length, Gpu.GetLength(array));

On the GPU however, we can utilize the array member functions as if the array would be an implicitly managed .NET array:

var sizeInfo = new long[1];
gpu.Launch(() => sizeInfo[1] = array.Length, new LaunchParam(1, 1));

var results = Gpu.CopyToHost(array);

Assert.AreEqual(Length, sizeInfo[0]);

Gpu.Free(array);

At the end we free the array explicitly to dispose the allocated GPU memory resources immediately. The example is in the sample gallery. It also shows how to allocate 2D arrays.

To copy between previously allocated arrays you can use the static member function

void Gpu.Copy<T>(T[] src, T[] dst)

Again there is no need to specify the GPU. The number of elements copied is equal to min(src.Length, dest.Length). If both arrays are living on the CPU Gpu.Copy throws an error. The following example is using Gpu.Copy. The arrays arg1 and arg2 live on the CPU. They are copied to the GPU gpu once with Gpu.Allocate and once with Gpu.Copy. The GPU array dResult is copied into a preallocated .NET array.

var dArg1 = gpu.Allocate(arg1);
var dArg2 = gpu.Allocate<int>(Length);
var dResult = gpu.Allocate<int>(Length);
Gpu.Copy(arg2, dArg2);

gpu.Launch(Kernel, lp, dResult, dArg1, dArg2);

var result = new int[Length];
Gpu.Copy(dResult, result);

Gpu.Free(dArg1);
Gpu.Free(dArg2);
Gpu.Free(dResult);

The full code is in the sample gallery.

Directly Working with Device Memory

Device memory provides even more flexibility as it also allows all kind of pointer arithmetics. Device memory is allocated with

Memory<T> Gpu.AllocateDevice<T>(int length)
Memory<T> Gpu.AllocateDevice<T>(T[] array)

The first overload creates a device memory object for the specified type T and length on the selected GPU. The second one allocates storage on the GPU and copies the .NET array into it. Both return a Memory<T> object, which implements IDisposable and can therefore support the using syntax which ensures proper disposal once the Memory<T> object goes out of scope. A Memory<T> object has properties to determine the length, the GPU or the device on which it lives. The Memory<T>.Ptr property returns a deviceptr<T>, which can be used in GPU code to access the actual data or to perform pointer arithmetics. The following example illustrates a simple use case of device pointers. The kernel only operates on part of the data, defined by an offset.

using (var dArg1 = gpu.AllocateDevice(arg1))
using (var dArg2 = gpu.AllocateDevice(arg2))
using (var dOutput = gpu.AllocateDevice<int>(Length/2))
{           
    // pointer arithmetics to access subset of data
    gpu.Launch(Kernel, lp, dOutput.Length, dOutput.Ptr, dArg1.Ptr + Length/2, dArg2.Ptr + Length / 2);

    var result = dOutput.ToArray();

    var expected = arg1.Skip(Length/2).Zip(arg2.Skip(Length/2), (x, y) => x + y);

    Assert.That(result, Is.EqualTo(expected));
}

The full code is in the sample gallery.

Using Special Memory Types


Shared Memory

An introdction to shared memory is in the GPU Programming section, where we illustrates how uncoalesced global memory access patterns can be alleviated with shared memory techniques.

Further details on shared memory and references are in the appendix.

Advanced shared memory usage often requires multiple dynamically sized arrays a single kernel. In this case you need to declare a single block of extern shared memory and use pointers and offsets to divide it into multiple arrays of possibly different types:

var s = Intrinsic.__array_to_ptr(__shared__.ExternArray<byte>());
var sharedfloat = s.Reinterpret<float>();
var sharedInt = (s + numFloats).Reinterpret<int>();

Shared memory is organized in banks. The bank size can be configured to either four or eight bytes at the level of a kernel by adding the attribute SharedMemConfig to a kernel as follows:

[SharedMemConfig(SharedMemConfig.FourBytesBankSize)]
private static void KernelUsing4BytesSharedMemBankSize(...)
{
    ...
}

[SharedMemConfig(SharedMemConfig.EightBytesBankSize)]
private static void KernelUsing8BytesSharedMemBankSize(...)
{
    ...
}

The matrix transpose sample shows how to exploit the configuration of the shared memory bank size to improve memory throughput.

Volatile Memory Qualifier

Some parallel algorithms require that the value of some variables in shared or global memory can be changed or used at any time by another thread. This implies that any reference to this variable has to compile to an actual memory read or write instruction. Declaring a variable in global or shared memory as volatile prevents compiler optimizations which might cache it in registers or local memory whose scope is specific to a single thread:

volatile_deviceptr<T> Intrinsic.__ptr_volatile<T>(deviceptr<T> ptr)

The example [warp reduction with shared memorysamples/csharp/warpreduceshared.html) relies on this technique and declares the data residing in shared memory with the volatile qualifier.

Local Memory

The characteristics of local GPU memory is explained in the appendix. Local memory is local to a thread. It must be allocated as a 1D or 2D array in a kernel with a compile time constant size.

T[] __local__.Array<T>(int length);
T[,] __local__.Array<T>(int rows, int cols);

It the can be used in a GPU kernel as an ordinary 1D or 2D array.

private static void Kernel(int[] result, int[] arg1, int[] arg2)
{
    var local = __local__.Array<int>(Length);

    var i = threadIdx.x;

    local[i] = arg2[i];
    result[i] = arg1[i] + local[i];
}

The complete code is in the sample gallery.

Constant Memory

The characteristics of constant GPU memory is explained in the appendix. Constant memory is used by defining symbols. We can create symbols for single values or for an array of values. They are declared as follows:

[GlobalSymbol]
private static readonly GlobalVariableSymbol<int> ConstValue = Gpu.DefineConstantVariableSymbol<int>();

private const int Length = 256;

[GlobalSymbol]
private static readonly GlobalArraySymbol<int> ConstArg = Gpu.DefineConstantArraySymbol<int>(Length);

Before we can use the constant memory symbols we have to assign values to them. This is done with a host to device copy:

var gpu = Gpu.Default;
gpu.Copy(value, ConstValue);
gpu.Copy(array, ConstArg);

In case of multiple GPUs the symbols are generated on each of them. However, the values have to be copied to each GPU separately. The constant memory symbols can then be used in a kernel:

public static void Kernel(int[] result)
{
    var tid = threadIdx.x;
    result[tid] = ConstArg[tid] + ConstValue.Value;
}

We can always assign new values to constant memory symbols by copying them to the GPUs. However, the values cannot be changed from within a GPU kernel. The sample code is in the sample gallery.

Unified Memory Support

The CUDA unified memory is accessible to both the CPU and GPU using a single pointer and the CUDA runtime automatically migrates data allocated in unified memory between host and device. Unified memory can simplify memory management in GPU-accelerated applications.

The device attribute Device.Attributes.ManagedMemory identifies those GPUs which have unified memory support. Unified memory must be allocated on a GPU and there are the following constraints:

  • The process must be 64 bit.
  • The element type must be blittable, Boolean is allowed but a struct with Boolean fields is not allowed.

Here is an example which uses unified memory. It initializes the data in unified memory from the CPU. A faster way would be to use Gpu.Copy to directly copy data to the GPU. We then launch a kernel that increments the values. Before we access the data again from the CPU we have to call Gpu.Synchronize() otherwise the result would be wrong or we would even crash the graphics driver. At last we access the data in unified memory from the CPU and check the result.

var gpus = Device.Devices.Where(device => device.Attributes.ManagedMemory == 1).Select(Gpu.Get).ToArray();

if (gpus.Length > 0)
{
    using (var unifiedMemory = gpus[0].AllocateUnified<int>(Length))
    {
        var gpu = gpus[0];

        for (var i = 0; i < Length; i++)
            unifiedMemory[i] = i;

        var ptr = unifiedMemory.Ptr;
        gpu.For(0, Length, i => ptr[i] += 100);

        gpu.Synchronize();

        var expected = Enumerable.Range(0, Length).Select(i => i + 100).ToArray();

        for (var i = 0; i < Length; i++)
            Assert.That(unifiedMemory[i], Is.EqualTo(expected[i]));
    }
} 

The full code is in the sample gallery. Note that unified memory does not offer the same performance as explicit host-device memory transfer. Moreover unified memory cannot serve as a replacement of the Alea GPU automatic memory management because it cannot be implemented on top of .NET array types. Additional details on unified memory are in the appendix.

Pinned Memory

The device property Device.Attributes.CanMapHostMemory can be query to check if a device supports pinned memory. Pinned memory is useful for optimized data transfer between the host and the device because The pinned memory sample measures timings for various copy operations.

Memory Copy ScenarioTime
Copy pageable host memory to pinned host memory1.19 ms
Copy pinned host memory to device memory1.15 ms
Copy pageable host memory to pinned to device memory2.21 ms
Copy pageable host memory directly to device memory1.22 ms
Allocate device memory with value using Allocate5.47 ms
Allocate device memory with values using AllocateDevice5.12 ms

Another use case of pinned memory is overlapping memory transfer with kernel execution using streams as explained in the next section. Further details on pinned memory are in appendix.

CUDA Streams


A CUDA stream represents a sequences of commands that execute in order. Different streams may execute their commands concurrently or out-of-order with respect to each other. Details are described in the appendix. Here we show how to use CUDA streams to increase parallelism on a GPU device.

A GPU device has a single default stream that is used for all host threads. It causes implicit synchronization. This means that two commands in different streams cannot run concurrently if a host thread issues a command to the default thread between these two commands. The default stream is useful if concurrency is not crucial to performance.

We specify the stream for a kernel launch or a host-device memory copy with the method call Stream.Launch respectively Stream.Copy. All kernel launches and memory copy that do not specify a stream explicitly run in the default stream. Overlapping kernel execution and memory copy depends on the GPU hardware capabilities. Kernels and memory copy can overlap is possible if Device.Attributes.GpuOverlap is nonzero. At least two copy engines are required to overlap a host to device with a device to host copy. The number of copy engines can be retrieved from Device.Attributes.AsyncEngineCount.

The first example illustrates kernel executions overlapping. It launches eight copies of a simple kernel on eight streams. We launch only a single thread block so there are plenty of resources to run multiple of them concurrently. Let us first run the kernel eight times in the default stream:

var results = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();

for (var i = 0; i < 8; i++)
{
    Gpu.Default.Launch(Transform.Kernel, new LaunchParam(1, 64), results[i]);
}

foreach (var result in results) Gpu.Free(result);

As expected the Visual Profiler shows that the default stream is executing all these kernels sequentially.

In Order Kernel Execution

We now launch the kernel in eight different streams:

var streams = Enumerable.Range(0, 8).Select(_ => gpu.CreateStream()).ToArray();
var results = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();

for (var i = 0; i < 8; i++)
{
    streams[i].Launch(Transform.Kernel, new LaunchParam(1, 64), results[i]);
}

foreach (var stream in streams) stream.Dispose();
foreach (var result in results) Gpu.Free(result);

The Visual Profiler reports that all the kernels execute concurrently on the GPU.

Overlapping Kernel Execution

To overlap host-device memory copy we have to use pinned memory and explicit memory management. Here is an example. Pinned memory is allocated with Gpu.AllocatePinned<T>(T[]), which is page-locked host memory. We explicitely copy the data in pinned memory from the host to the device, run the kernel and copy back the results, all in a stream. Note that we explicitely free it with PinnedMemory<T>.Dispose().

var pinned = Enumerable.Range(0, 8).Select(_ => gpu.AllocatePinned<double>(RandomArray(Length))).ToArray();

var streams = Enumerable.Range(0, 8).Select(_ => gpu.CreateStream()).ToArray();
var args = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();
var results = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();

for (var i = 0; i < 8; j++)
{
    streams[i].Copy(pinned[i], args[i]);
    streams[i].Copy(results[i], pinned[i]);
}

foreach (var stream in streams) stream.Synchronize();
foreach (var result in results) Gpu.Free(result);
foreach (var arg in args) Gpu.Free(arg);
foreach (var arg in pinned) arg.Dispose();
foreach (var stream in streams) stream.Dispose();

If we look at the timeline in the Visual Profiler we see that now, besides overlaped kernel execution, also copy from device to host overlaps with host to device copy.

Overlapping Memory Transfer

It is also possible to overlap memory copy and kernel execution:

var args = Enumerable.Range(0, 8).Select(_ => RandomArray(Length)).ToArray();

var streams = Enumerable.Range(0, 8).Select(_ => gpu.CreateStream()).ToArray();
var argsDevice = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();
var results = Enumerable.Range(0, 8).Select(_ => gpu.Allocate<double>(Length)).ToArray();

// warm up and JIT
streams[0].Launch(Transform.Kernel1, lp, results[0]);

for (var i = 0; i < NumIterations; i++)
{
    for (var j = 0; j < 8; j++)
    {
        streams[j].Copy(args[j], argsDevice[j]);
        streams[j].Launch(Transform.Kernel2, lp, results[j], argsDevice[j], Transform.Op);
    }

    gpu.Synchronize();
}

foreach (var stream in streams) stream.Synchronize();
foreach (var result in results) Gpu.Free(result);
foreach (var arg in args) Gpu.Free(arg);
foreach (var arg in argsDevice) Gpu.Free(arg);
foreach (var stream in streams) stream.Dispose();

Overlapping Kernel Execution and Memory Transfer

The sample can be found in the sample gallery.

Only newer GPUs with multiple copy engines can overlap memory transfer and it is only possible to overlap host-to-device with device-to-host and vice versa. The device properties Device.Attributes.GpuOverlap respectively Device.Attributes.AsyncEngineCount indicate if the GPU device can overlap copy and kernel execution and the number of copy engines.

Recommendation:

We advise to use explicit memory management whenever host-device memory copy overlapping is desired.

Performance Measurements and Profiling


The NVIDIA Visual Profiler is a graphical profiling tool that displays a timeline of the application's CPU and GPU activity and various performance metrics. It is very useful to identify optimization opportunities. Visual Profiler is part of the NVIDIA CUDA Tookit.

The Visual Profiler is available as a standalone application and, as part of Nsight Visual Studio Edition, also integrated in Visual Studio. Here we use the standalone application because it gives more flexibility and it does not have the 32 bit limitation that has the Nsight Visual Profiler in Visual Studio.

Let's walk through a profiling exercise. For this we use the project MatrixTransposeShared in the sample gallery. Build the project, best in release mode and launch Visual Profiler. Choose FileNew Session and enter the path to the executable. We turn off Enable unified memory profiling

Profiling Execution Properites

Visual Profiler will do a first round profiling. Next choose Examine Individual Kernels in the CUDA Application Analysis pane:

Examine Individual Kernels

The resutl of this profiling is a listing which displays the kernels ordered according to their optimization importance based on achieved occupancy and execution time. For the matrix transpose example we get the following ordering:

Kernel Optimization Importance

As expected the SimpleKernel has the highest optimization potential. We select this kernel and choose Perform Kernel Analysis

Perform Kernel Analysis

Visual Profiler identifies the main bottleneck of the kernel is memory bandwidth:

Perform Kernel Analysis

Choose Perform Memory Bandwidth Analysis to further investigate the access pattern:

Perform Kernel Analysis

More insight can be found by collecting additional events. Select the Details pane then the kernel and then click the icon to configure the metrics and events

Configure Metrics and Events

Configer the metrics and events as required. Here we are interested in the shared memory load bank conflicts:

Configure Shared Mem Events

Select Apply and Run. The result is as expected, the kernel NoBankConflictsKernel has 0 shared load bank conflicts:

Detailed Analysis with Events

Note that we explicitly set the shared memory bank size for the kernel NoBankConflictsBank8Kernel for double precision to 8 bytes. This can be verified in the properites of the kernel:

Bank Size

Recommendation:

We suggest to use the standalone Visual Profiler application and not the profiler that comes along with the Nsight Visual Studio Edition because of its limitation to 32 bit.

Using NVIDIA CUDA Libraries


P/Invoke Libraries

Several packages are available to provide P/Invoke access to CUDA libraries:

To use P/Invoke libraries, you need configure system environment so that they can be found:

  • Alea.CudaToolkit - On Windows, you need to set CUDA_PATH environment variable, to specify the location of your Cuda toolkit installation. On Linux or Mac OS, they should be visible to ldconfig system.
  • Alea.CudaDnn - On Windows, the cuDNN library should be copied into the Cuda toolkit installation folder, pointed by CUDA_PATH environment variable.

Samples are available at:

Integrated Libraries

Several important CUDA libraries are deeply integrated with Alea GPU so that they can be used along with the types and abstractions provided by Alea GPU. In particular they can be used together .NET arrays as well as with automatic memory management.

  • The NVIDIA cuBLAS library is a GPU-accelerated version of the complete standard BLAS library. Details on the API and its usage can be found in the NVIDA cuBLAS documentation. A cuBLAS example is in the sample gallery.

  • The NVIDIA cuRAND library provides high performance GPU-accelerated random number generation. Alea GPU cannot expose the cuRAND device level interface because these routines have to be directly compiled into the kernel function. A later version of Alea GPU will provide similar functionality. For additional details on the API and its usage we refer to the NVIDIA cuRAND documentation. Examples of how to use cuRAND are in the sample gallery.

  • The NVIDIA cuDNN library is a GPU-accelerated library of primitives for deep neural networks.

Printing in Kernels


Printing from a GPU kernel is a basic but powerful diagnostic technique. Alea GPU supports Console.WriteLine and Console.Write inside kernel code. This allows you for example to print the block and thread index

Gpu.Default.Launch(() => Console.WriteLine("block index {0}, thread index {1}", blockIdx.x, threadIdx.x), new LaunchParam(2, 4));

Note that all prints from kernels are written to native stdout. This cannot be captured by tools such as Resharper, which can only show the managed stdout.

The following overloadings are supported:

  • WriteLine()
  • WriteLine(bool) Write(bool)
  • WriteLine(int32/uint32/int64/uint64) Write(int32/uint32/int64/uint64)
  • WriteLine(single/double) Write(single/double)
  • WriteLine(string) Write(string)
  • WriteLine(format, obj1, obj2, ...) Write(format, obj1, obj2, ...)

The supported argument types are:

  • bool
  • integers (int8/int16/int32/int64/uint8/uint16/uint32/uint64)
  • floating point number (single/double)
  • string
  • deviceptr

Alea GPU supports a part of .NET composite formatting:

Currently format strings with embedded variables such as $"text {variable}" are not supported.

GPU Metaprogramming


The purpose of metaprogramming is to generate different GPU code from a common template code. In C++ metaprogramming is usually done with C++ templates. Template code is substituted completely at runtime, hence it is static. No additional runtime overhead is created because C++ templates do not use inheritance, virtual functions or late binding. This is particularly useful for GPU code, where optimal performance matters.

The type parameter concept of .NET are generics. Generics are very different from C++ templates. First of all generics are runtime based. A special type is used to represent generic type arguments. The effecitve type is then substutided at runtime, which as compared to C++ templates, generates some runtime overheading.

Alea GPU provides several techniques for GPU metaprogramming and JIT compile code to different concrete versions at runtime. There are several use cases.

Simple Types Specializations

In CUDA C++ is is quite common to write template kernel code

template<typename T>
void kernel(T* data, int length);

and use it for different types

kernel<double>(doubleData, length);
kernel<int>(intData, length);

These two template instances are compiled twice and the final object module contains two GPU kernels, one for double and one for int. The benefit of C++ templates is that the generated code is fast and that there is no runtime overhead. However, there is one problem: C++ template code generation happens completely at compile time. Each time we want to support a new type we have to recompile. From a deployment perspective, we cannot distribute C++ templates code as a binary. It is necessary to distribute them in source code, which reveals all the implementation details.

Deployment in .NET is much more streamlined, based on the concept of assemblies. Alea GPU uses .NET generics to simulate C++ template code generation. GPU kernels can be generic functions

static void kernel<T>(T[] data);

The compilation happens at runtime when you first call it with a specific type argument:

kernel<double>(doubleArray);
kernel<float>(floatArray);

The function kernel<T> is JIT compiled for double and int resulting in GPU code for both types. JIT compilation has a small runtime overhead when you first call the kernel, but everything is contained in the assembly. Use cases of this runtime JIT compilation mechanism are plugin frameworks or libraries for algorithms that have to work on multiple types. For deployment is it enough to distribute the assembly. The enduser JIT compiles all required concrete kernels at runtime. It is even cross-platform, which means the same assembly can execute on Windows, Linux or Mac OS X. An example is in the sample gallery.

Metaprogramming with Functions and Operators

C++ templates types can be very general based on implicit assumtions which then are checked during compile time. Here the type T is assumed to be a function:

template<typename T>
void kernel(float* data, T op) {  data[0] = op(data[0]); }

C++ templates types are just substuted at compile time. The compiler generate a compile error if the type T does not have an overloaded operator() with the right signature. These errors are often very complex and hard to fix.

To implement generic GPU code that is parameterized by functions Alea GPU relies on .NET delegates. As delegates are typesafe function pointers we have to specify the signature of the delegate:

static void kernel(float[] data, Func<float, float> op) { data[0] = op(data[0]); }

If multiple delegates are required it is a good practice to create a closure class to hold them:

class Foo
{
    [GpuParam] Func<float, float> op1;
    [GpuParam] Func<float, float> op2;

    void kernel(float[] data) { data[0] = op1(data[0]; data[1] = op2(data[1]); }   
}   

A delegate may enclose additional values. Alea GPU needs to be aware that the closure of these fields have to be passed as parameters to the GPU before calling these functions. This is achieved with the attribute [GpuParam]. The sample gallery contains a complete example.

Metaprogramming with Values

Constant size static arrays in kernels are useful for shared and local memory. In C++ we handle this with compile time constants in templates:

template<int Size>
void kernel(...)
{
    __shared__ int shared[Size];
    ...
}

For this use case Alea GPU provides the type Constant<T>. The type T must support comparison. Constant values can be passed to a kernel as an argument:

static void kernel(...., Constant<int> size)
{
    var shared = __shared__.Array<int>(size.Value);
    ...
}

For each value of Constant<int> the JIT compiler creates a new kernel instance:

kernel(..., new Constant<int>(2));
kernel(..., Gpu.Constant(2)); 
kernel(..., Gpu.Constant(3));

This will lead to two JIT complations, because two constant object of value 2 are equal. An illustration of this use case is in the sample gallery.

Metaprogramming with Polymorphism

Often it is necessary to dispaatch between different implementations.

In C++ we can apply partial template specialization:

typedef enum { SMEM = 0, SHFL = 1 } ReduceImpl;

template<ReduceImpl Impl, typename T, typename Op>
__device__ void Reduce(T inputPerLane, Op op);

template<typename T, typename Op>
__device__ void Reduce<SMEM, T, Op>(T inputPerLane, Op op)
{ // implementation with shared memory }

template<typename T, typename Op>
__device__ void Reduce<SHFL, T, Op>(T inputPerLane, Op op)
{ // implementation with shfl }

We can achieve the same with abstract virtual functions and inheritance:

public enum WarpReduceAlgorithm
{
    Smem,
    Shfl
}

public abstract class WarpReduce
{
    public Constant<WarpReduceAlgorithm> Algorithm { get; }

    // WarpReduce itself could be wrapped as Constant<WarpReduce>, we have to implement comparison for JIT cache.
    public override bool Equals(object obj)
    {
        var that = obj as WarpReduce;
        if (that != null)
        {
            return Algorithm.Equals(that.Algorithm) && WarpThreads.Equals(that.WarpThreads);
        }
        return false;
    }

    public override int GetHashCode()
    {
        return Algorithm.GetHashCode() | WarpThreads.GetHashCode();
    }

    // these will be implemented in derived classes
    public abstract T Reduce<T>(deviceptr<byte> warpSharedMem, T input, Func<T, T, T> reductionOp);

    public abstract T Reduce<T>(deviceptr<byte> warpSharedMem, T input, Func<T, T, T> reductionOp, int numItems);
}

public class WarpReduceSmem : WarpReduce
{ // implement reduce with smem }

public class WarpReduceShfl : WarpReduce
{ // implement reduce with shfl }

Symbols

Constant memory is identified with a symbol. Symbols are global identifiers of a specific GPU. Data is copied to a symbols with an instance method such as gpu.Copy(memory, symbol). Be careful to create symbol instances because it could lead to unnecessary additional JIT compilations. Here is an example.

static void kernelUsingSymbol(...., GlobalArraySymbol<int> cdata)
{  ....;   cdata[i] .... }

{
    var cdata = gpu.DefineConstantArray<int>(10);
    gpu.Copy(mydata, cdata);
    gpu.Launch(kernelUsingSymbol, lp, ...., cdata);
}

{
    // somewhere else 
    cdata = gpu.DefineConstantArray<int>(10); // different from first symbol
    gpu.Copy(mydata2, cdata);
    gpu.Launch(kernelUsingSymbol, lp, ...., cdata);
}

The second symbol is a different instance, although of the same length and probably for the same purpose. Alea GPU requires an additional JIT compilation of kernelUsingSymbol because we created a new global definition that the kernel must be aware of. The full code is in the sample gallery.

Defining and Accessing Values for Metaprogramming

The Alea GPU JIT compiler can use the following types and objects to generate specialized GPU code:

  1. Generic type arguments
  2. Delegates
  3. Constant values wrapped with Constant<T>
  4. Symbols such as constant GPU memory or texture symbols

Delegates, constant values and symbols can be made accessible to GPU code in a variety of ways:

  1. Pass them as kernel arguments to a kernel function.
  2. Reference them directly in the kernel so that the C# compiler generates a closure.
  3. Put them in a user defined class as field. In this case, delegate needs to be marked as [GpuParam] since it could enclose additional runtime values.
  4. Put them in a static readonly field.