Overview

This section gives an overview how to program GPUs with C#. We cover the following topics:

Note:

We use the term device synonymously for GPU and host for CPU to emphasize the fact that a GPU is often a physically separate device that operates as a coprocessor to the host.

Parallel-For


Parallel-for is useful to execute the same independent operation for each element of a collection or for each index of an ordered range. The operations are independent if they do not write to memory locations that are accessed by other operations. In contrast to a serial loop, the order of execution of the operations is not defined and operations can run in parallel. The signature of the parallel-for is

void Gpu.For(int start, int end, Action<int> op);

The first two arguments specify a range with a lower inclusive bound and upper exclusive bound. The third argument is the action that is invoked for each number in the series, taking its number as argument. Gpu.For requires a GPU instance to run. Alea GPU provides a default GPU for convenience.

var gpu = Gpu.Default;
var n = ...
gpu.For(0, n, i => 
{
    ...
});

The arguments and the result are captured in a closure and passed to the parallel-for body.

Parallel-For Closure

Note:

The operations in the loop body must be independent of each other. They are not allowed to communicate by writing to shared variables or shared array elements.

Here is a complete example that calculates the element-wise sum of two arrays:

var gpu = Gpu.Default;
var arg1 = Enumerable.Range(0, Length).ToArray();
var arg2 = Enumerable.Range(0, Length).ToArray();
var result = new int[Length];

gpu.For(0, result.Length, i => result[i] = arg1[i] + arg2[i]);

The code for this sample can be found in the sample gallery.

The action accesses data elements that are defined outside of the loop body and writes the result directly to a .NET array. Alea GPU takes care of all the memory management. The next section explains this in more detail.

Parallel Aggregate


Parallel aggregate reduces a collection of elements with an associative binary operator to a single value. The implementation does not require that the operator is commutative. The signature of the parallel aggregate method is:

T Gpu.Aggregate<T>(T[] elements, Func<T, T, T> op);

The following example calculates the sum of an array:

var gpu = Gpu.Default;
var arg = Enumerable.Range(0, Length).ToArray();

var result = gpu.Aggregate(arg, (x, y) => x + y);

For convenience Alea GPU already provides overloads T Gpu.Sum<T>(T[]) and T Gpu.Average<T>(T[]) for the sum and the average of array. A sample can be found in the sample gallery.

Automatic Memory Management


GPU devices have their own on-board memory, the global device memory. It is used for transfers between the CPU and the GPU as well as for the data input to and output from kernels. The name global means it can be accessed and modified from both the CPU and the GPU. Global memory allocations can persist for the lifetime of the application.

CPU-GPU Memory

In a typical PC or compute cluster node, the memories of the CPU and GPU are physically distinct and connected by the PCI express bus. Data that is shared between the CPU and GPU must thus be allocated in both memories, and copied between them. Usually, this has to be done by the programmer, increasing the complexity of GPU programs.

The Alea GPU automatic memory management system handles memory allocation and data movement between the different memories without the programmer having to manage this manually. It is efficient – unnecessary copy operations are avoided by analyzing the memory access. The analysis requires code instrumentation, a technique that inserts additional instructions into an existing execution path. Alea GPU modifies the CPU code by inserting instructions that monitor array accesses and perform minimum data transfers between the CPU and GPU.

As these runtime checks generate a slight performance overhead, the scope of analysis is limited to the code carrying the attribute [GpuManaged]. Leaving out this attribute never means that data will not be copying – it may only affect unnecessary intermediate copying.

To illustrate the automatic memory management in more detail, we look at an example. We iterate 100 times a parallel-for loop that increments the input by one. First of all, we consider the situation without the [GpuManaged] attribute. In this case, the data is automatically copying, although more frequently than necessary due to a limited scope of analysis.

public static void Unmanaged()
{
    var data = new int[Length];

    for (var k = 0; k < 100; k++)
        Gpu.Default.For(0, data.Length, i => data[i] += 1);

    var expected = Enumerable.Repeat(100, Length).ToArray(); ;
    Assert.That(data, Is.EqualTo(expected));
}

We check the memory copy operations by using the Visual Profiler. As expected the low level CUDA driver functions cuLaunchKernel, cuMemcpyHtoD_v2 and cuMemcpyDtoH_v2 to launch the kernel and to perform memory copy are called 100 times each. This means that the data is copied in and out for each of the 100 sequential parallel for launches.

NameCount
cuMemAlloc_v21
cuMemcpyHtoD_v2100
cuLaunchKernel100
cuMemcpyDtoH_v2100
cuMemFree_v21

Now we turn on the automatic memory management.

Important:

To use the Alea GPU automatic memory management, it is required to

  1. Reference the Alea.Fody assembly
  1. Add the attribute [GpuManaged] to the code that has to be managed

Apart from the attribute [GpuManaged], the code is identical. Alea GPU now analyzes the memory accesses inside this scope and therein optimizes the memory copies as well as possible.

[GpuManaged]
public static void Managed()
{
    var data = new int[Length];

    for (var k = 0; k < 100; k++)
        Gpu.Default.For(0, data.Length, i => data[i] += 1);

    var expected = Enumerable.Repeat(100, Length).ToArray(); ;
    Assert.That(data, Is.EqualTo(expected));
}

We analyze the effect with the Visual Profiler and see that cuMemcpyHtoD_v2 and cuMemcpyDtoH_v2 are now called just once. The reason is that result data of a preceding GPU parallel-for loop can stay on the GPU for the succeeding parallel-for loop without need of copying the intermediate data back and forth to CPU. Copying is only involved for the input of the first GPU execution as well as for the output of the last GPU computation.

NameCount
cuMemAlloc_v21
cuMemcpyHtoD_v21
cuLaunchKernel100
cuMemcpyDtoH_v21
cuMemFree_v21

The code for this sample can be found in the sample gallery.

Note:

Automatic memory management is only possible for arrays with blittable element type.

CUDA Programming Model


Parallel-for and parallel aggregate are predestined for standard parallel tasks. The CUDA programming model provides more flexibility at the expense of an increased complexity. It is based on the notion of a kernel, which is a program that is executed multiple times on the GPU in different threads. The threads are hierarchically grouped into thread blocks, each with the same number of threads. Thread blocks are again grouped into a grid.

Each thread has a thread index and a block index that identifies the thread within the block and the block within the grid. For convenience, the thread and block index are three dimensional and can be accessed inside the scope of a kernel through the built-in variables threadIdx.x, threadIdx.y, threadIdx.z and blockIdx.x, blockIdx.y, blockIdx.z. The shape of a block and the extent of the grid are available through blockDim.x, blockDim.y, blockDim.z and gridDim.x, gridDim.y, gridDim.z, respectively.

At launch time, a kernel needs to know its launch parameters, determining the block and grid size. The hardware imposes some limitations on the sizes. For instance, a block may contain up to 1024 threads and the grid dimension is limited by 65535 in each dimension.

It remains the programmer’s responsibility to define how the threads and blocks are used to perform a parallel computation.

A common practice is to launch one thread per data element. The downside of this approach is that the grid and block layout becomes data-dependent. An alternative approach is to use a fixed grid and block shape. Depending on the amount of data, a GPU thread needs to process more than a single data element. The sample gallery contains many examples, showing how launch parameters can be chosen and how threads and blocks can be applied for a variety of parallel workloads. Additional details about the CUDA programming model can be found in the appendix.

Managed CUDA Kernel

We use the CUDA programming model to write a parallel transform on the GPU. For this purpose, we define a one-dimensional grid of blocks. The total number of threads is therefore blockDim.x * gridDim.x. Here is a schematic illustration how a grid of thread blocks process the elements of an array in three iterations:

Transform

The actual GPU kernel is a static method of a C# class. Its arguments are the input arrays and the array for the result.

private static void Kernel(int[] result, int[] arg1, int[] arg2)
{
    var start = blockIdx.x*blockDim.x + threadIdx.x;
    var stride = gridDim.x*blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        result[i] = arg1[i] + arg2[i];
    }
}

Each thread that runs this kernel first calculates the index of the first element it has to process and stores it in start. As we plan to launch this kernel with a one-dimensional grid of one-dimensional thread blocks, the total number of threads working in parallel will be gridDim.x*blockDim.x. Then, each thread loops through the data with an increment of stride.

Note that you can use .NET arrays of any dimension directly in the GPU code, including Array.Length, Array.LongLength, Array.Rank and Array.GetLength(int dim), Array.GetLongLength(int dim). The next step is to launch the GPU kernel.

[GpuManaged]
public static void Run()
{
    var gpu = Gpu.Default;
    var lp = new LaunchParam(16, 256);
    var arg1 = Enumerable.Range(0, Length).ToArray();
    var arg2 = Enumerable.Range(0, Length).ToArray();
    var result = new int[Length];

    gpu.Launch(Kernel, lp, result, arg1, arg2);

    var expected = arg1.Zip(arg2, (x, y) => x + y);

    Assert.That(result, Is.EqualTo(expected));
}

We write a method Run and attribute it with [GpuManaged] to instruct Alea GPU to automatically manage the memory for all the code of this block. As explained in the previous section the Alea GPU runtime system will allocate the required storage on the selected GPU, copy data to the GPU memory and transfer results back to the CPU before they are accessed form CPU code. In the example, the transfer of the result back to the CPU will happen in the Assert statement at the end of the Run method.

The static property Gpu.Default returns a suitable default GPU. More details about the selection algorithm are explained in the section on using multiple GPUs. The next step is to determine the grid and block shape. For this example, we choose a fixed grid size of 16 blocks, each consisting 256 threads. The kernel is launched on the selected GPU with a call to the Gpu.Launch method. It has multiple overloads for up to 16 arguments.

Hint:

You can use ordinary .NET arrays in the GPU kernel and access properties such as Array.Length.

Note:

Attribute the method which [GpuManaged] to turn on optimized automatic GPU memory management within this method.

The code for this sample can be found in the sample gallery.

Generics


Generics are methods, classes, structures or interfaces that have type parameters for one or more of the types that they use. Generics can also be used in GPU kernels. We illustrate it with a copy kernel.

public static void Kernel<T>(T[] output, T[] input)
{
    var start = blockIdx.x*blockDim.x + threadIdx.x;
    var stride = gridDim.x*blockDim.x;
    for (var i = start; i < output.Length; i += stride)
    {
        output[i] = input[i];
    }
}

The method Kernel<T> has a type parameter T, which specifies the element type of the arrays. We can also write a generic launch function

private static void Run<T>(int n)
{
    var gpu = Gpu.Default;
    var lp = new LaunchParam(16, 256);
    var input = Enumerable.Range(0, n).Select(ConvertValue<T, int>).ToArray();
    var result = new T[n];

    gpu.Launch(CopyGeneric.Kernel, lp, result, input);

    Assert.That(result, Is.EqualTo(input));
}

which we then can call for multiple element types with automatic memory management enabled:

[GpuManaged]
public static void Int()
{
    Run<int>(Length);
}

The code is in the sample gallery.

Lambdas and Delegates


Lambdas and delegates can be used in a variety of ways within GPU kernels or as arguments to GPU kernels. Lambdas are also powerful to unify CPU and GPU code. If a lambda or delegate does not rely on GPU specifics, such as shared memory or special GPU intrinsics, it can be executed on a GPU as well as on the CPU.

The first example uses delegates with the parallel-for pattern. We can use a delegate to define an action.

Func<T, T, T> op = ...
Action<int> action = i => result[i] = op(arg1[i], arg2[i]);

Gpu.Default.For(0, result.Length, action);

The full example is in the sample gallery. Anoter example illustrates partial function application.

We can also write a GPU kernel directly as a delegate and launch it. The arguments and the result are included in the function closure.

Action kernel = () =>
{
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    var stride = gridDim.x * blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        result[i] = arg1[i] + arg2[i];
    }
};

gpu.Launch(kernel, lp);

Generics and delegates can be used to write custom kernels. There is an example of a generic transform. The kernel takes a binary function object as argument and applies it to arrays of input data.

public static void Kernel<T>(T[] result, T[] arg1, T[] arg2, Func<T, T, T> op)
{
    var start = blockIdx.x*blockDim.x + threadIdx.x;
    var stride = gridDim.x*blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        result[i] = op(arg1[i], arg2[i]);
    }
}

We can launch it, for example, with a lambda as follows

gpu.Launch(TransformGeneric.Kernel, lp, result, arg1, arg2, (x, y) => x + y);

The full source code of the generic transform is in the sample gallery.

Another use case are higher order functions to build kernels as follows:

Func<Action<int>, Action> kernelBuilder = a => () =>
{
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    var stride = gridDim.x * blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        a(i);
    }
};

Gpu.Default.Launch(kernelBuilder(i => result[i] = arg1[i] + arg2[i]), lp);

The example is in the sample gallery.

More complicated applications require generic arithmetic operations, such as for example the generic matrix multiplication. It uses a generic class, type constaints and generic delegates for the generic arithmetic operation.

Debugging


On Windows GPU kernels can be debugged with the NVIDIA Nsight Visual Studio Edition debugger. It allows to set breakpoints directly in GPU kernel code, inspect memory, watch the values of local variables or perform memory checks. To enable debugging the App.config file requires a configuration section that sets the JIT compilation level to Diagnostic:

<configuration>
    <configSections>
        <section name="aleaSettings" type="Alea.Settings, Alea"/>
    </configSections>
    <aleaSettings>
        <jitCompile level="Diagnostic"/>
    </aleaSettings>
</configuration>

The project has to be built with a debug configuration. The menu item NsightStart CUDA Debugging starts the GPU debugging process. The GPU debugger stops at breakpoints that are set in

Start Debug

Data and variables can be inspected in a warp watch window. It displays the values of variables and memory locations for all threads of a warp. This gives a good overview of the operations of the threads of a warp. The menu item NsightWindowsCUDA Warp Watch opens a warp watch window. It can also be used to display .NET arrays. For instance array->Pointer[idx] displays the element at index idx and array->Lengths[n] retrieves length of n.

Debug with Warp Watch

Tip:

Class libraries cannot be debugged, change the project type to Console Application in the Application tab of the project properties.

The GPU debugger only stops at breakpoints that are set in GPU code. It cannot catch breakpoints in ordinary CPU code.

Shared Memory


Not all global memory access patterns are equally efficient. The GPU device tries to coalesce global memory loads and stores into as few transactions as possible to minimize global memory bandwidth. However this is not always possible. The following conditions may result in uncoalesced memory transactions and the memory access has to be serialized:

  • Memory is not sequential
  • Sparse memory access
  • Misaligned memory access

Coalescing memory access is highly relevant for efficient global memory access. It is often possible to coalesce memory access by using shared memory as an intermediate buffer. Shared memory is on-chip and much faster than local and global memory. It is allocated per thread block and all threads in the block have access to the same shared memory. Threads can access shared memory written by other threads within the same thread block. This capability combined with thread synchronization is useful for

  • User-managed caches
  • High-performance cooperative parallel algorithms such as parallel scan and reduce
  • Enabling coalescing global memory operations

More details about shared memory can be found in the chapter advanced features and in the appendix.

When sharing data between threads, we must be careful to avoid race conditions. This may make it necessary to synchronize threads in certain cases. For this purpose, a barrier can be established with the synchronization primitive DeviceFunction.SyncThreads(). A thread only proceeds past the DeviceFunction.SyncThreads() after all threads in its block have executed DeviceFunction.SyncThreads() too.

Shared memory inside a kernel can be declared in multiple ways, depending on whether the amount of memory is known at compile time or at run time. The following example reverses an array of 64 elements using shared memory. The first kernel declares an array of constant size inside the kernel.

private const int Length = 64;

private static void Static<T>(T[] data)
{
    var shared = __shared__.Array<T>(Length);
    var i = threadIdx.x;
    var ir = data.Length - i - 1;
    shared[i] = data[i];

    DeviceFunction.SyncThreads();

    data[i] = shared[ir];
}

In this kernel i and ir are the indices of the original and reverse order. Threads copy data first from global memory to distinct locations in shared memory. After the barrier, threads copy data from shared memory back to distinct locations in global memory, however, in reversed order. Two points are important:

Note:

Thread synchronization is crucial because threads access data written by other threads. The array length used to declare the shared memory inside the kernel must be compile time constant.

The next example uses dynamic shared memory that is allocated outside the kernel.

private static void Dynamic<T>(T[] data)
{
    var shared = __shared__.ExternArray<T>();
    var i = threadIdx.x;
    var ir = data.Length - i - 1;
    shared[i] = data[i];

    DeviceFunction.SyncThreads();

    data[i] = shared[ir];
}

The amount of shared memory allocated per thread block is specified in bytes along with the kernel launch parameters (third parameter):

var lp = new LaunchParam(1, Length, Length * sizeof(T));
gpu.Launch(Dynamic, lp, data);

The two kernels are very similar, they only differ in how the shared memory arrays are declared and in the kernels launch parameters. The purpose of using shared memory is to achieve coalescing global memory access. Optimal memory coalescing is reached if global memory is accessed through a linear aligned index such as i = threadIdx.x. The reversed index ir is only used to access shared memory, which does not have this sequential access restriction. The code is in the sample gallery.

More advanced utilization of shared memory are in the appendix and in the sample gallery:

GPU Selection and Multiple GPUs


A system can have multiple GPU devices, each is identified with it's device id.

Note:

The list of all device ids for the GPU devices in a system does not need to be continuous.

The property Device.Devices returns all GPU devices in a system. The device instance can be used to get the properties and attributes of a device:

var devices = Device.Devices;
var numGpus = devices.Length;
foreach (var device in devices)
{
    // print device information to standard output
    device.Print(); 

    var id = device.Id;
    
    var arch = device.Arch;
    var numMultiProc = device.Attributes.MultiProcessorCount;
}

The GPU for a specific device id is retrieved by gpu = Gpu.Get(id). To find out all the available device ids in a system use the property Device.Devices and query the device id for each element.

var deviceIds = Device.Devices.Select(device => device.Id);

Alea GPU includes an internal algorithm to determine a suitable default GPU, to be returned by the property Gpu.Default. We first search for GeForce GPUs. If there is one, we select the GPU with the largest number of cores. If there is no GeForce GPU, we choose as default GPU the one with the largest number of cores, which then could be a Tesla or Quadro GPU. There is a device query example in the sample gallery.

We can use multiple GPUs very easily if the computations can be decomposed into multiple independent parts and the partial results can be aggregated. This is the well know map-reduce pattern. Here is a simple example that uses the parallel-for pattern in a multi-GPU setting. First, we create a list of all GPUs in the system:

var gpus = Device.Devices.Select(device => Gpu.Get(device.Id)).ToList();
var numGpus = gpus.Count();

Second, we decompose the input data into independent partitions and launch the computations. Alea GPU automatically allocates and copies data onto to the corresponding GPUs.

var results = gpus.Select((gpu, k) =>
{
    // partition data for single GPU
    var arg1Part = arg1.Skip(k*Length).Take(Length).ToArray();
    var arg2Part = arg2.Skip(k*Length).Take(Length).ToArray();
    var resultPart = new int[Length];

    gpu.For(0, resultPart.Length, i => resultPart[i] = arg1Part[i] + arg2Part[i]);

    return resultPart;
});

The third and final step is to aggregate the partial results.

var result = results.SelectMany(r => r).ToArray();

Note that the result is distributed among multiple GPUs. Each GPU holds a part of the result. To get the final result, the arrays have to be flatten or aggregated on a GPU or the CPU to a final value. The above example code performs the final aggregation on the CPU. This is a typical map-reduce pattern. The complete code is in the sample gallery.

Custom Types


It is possible to create custom types with structures and use them in GPU kernels. Structures (structs) are value types, whereas .NET classes constitute reference types living on the heap. Because structs are value types, we can efficiently copy them between the CPU and GPU memory. Structs can also be generic types.

Here is an example of a struct that represents a complex number. The type parameter T is usually a floating point type such as double or float.

public struct Complex<T>
{
    public T Real;
    public T Imag;

    public override string ToString()
    {
        return $"({Real}+I{Imag})";
    }
}

Note that the current version of Alea GPU only supports fields in structs but not properties. In particular, it is not possible to define the Real and Imag parts of a complex number with properties:

public struct Complex<T>
{
    public T Real { get; set; }
    public T Imag { get; set; }

    ...
}

The following delegate performs complex addition of elements of type Complex<T>. It creates the result value directly with the default constructor. Note that this delegate is free of any GPU specific code and can be executed on the CPU and GPU alike.

var add = (x, y) => new Complex<T>
{
    Real = x.Real + y.Real,
    Imag = x.Imag + y.Imag
};

This delegate can now be used in the parallel Gpu.For to perform element-wise complex addition

Gpu.Default.For(0, result.Length, i => result[i] = add(arg1[i], arg2[i]));

or in a custom generic kernel such as

public static void Kernel<T>(Func<T, T, T> op, T[] result, T[] arg1, T[] arg2)
{
    var start = blockIdx.x*blockDim.x + threadIdx.x;
    var stride = gridDim.x*blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        result[i] = op(arg1[i], arg2[i]);
    }
}

Gpu.Default.Launch(TransformGeneric.Kernel, lp, add, result, arg1, arg2);

We refer to the parallel-for with delegate and to the generic transform example for additional details.

Device Functions


In the module DeviceFunction Alea GPU exposes generic versions of many NVIDIA LibDevice functions as well as CUDA specific functions such as shuffle instructions, atomic instructions and warp instructions.

Most math functions from System.Math are overloaded so that on the GPU the corresponding NVIDIA LibDevice function is called. A list is in the appendix.

Synchronization Functions

  • DeviceFunction.SyncThreads
  • DeviceFunction.SyncThreadsCount
  • DeviceFunction.SyncThreadsAnd
  • DeviceFunction.SyncThreadsOr

Memory Fence Functions

  • DeviceFunction.ThreadFenchBlock
  • DeviceFunction.ThreadFence
  • DeviceFunction.ThreadFenceSystem

Shuffle Instructions

GPUs with the new Kepler architecture provide several shuffle intrinsics. This allows threads in a warp to collectively exchange or broadcast data without using shared memory. The exchange occurs simultaneously for all active threads within the warp. Alea GPU supports the different shuffle instructions

  • DeviceFunction.Shuffle
  • DeviceFunction.ShuffleUp
  • DeviceFunction.ShuffleDown
  • DeviceFunction.ShuffleXor

The NVIDIA CUDA Programming Guide contains more information. For more convenience we also added the class FullWarpShuffle which provides generic shuffle overloads for such as

  • T Broadcast<T>(T input, int srcLane)
  • T Up<T>(T input, int srcOffset)
  • T Down<T>(T input, int srcOffset)

Using shuffle intrinsics instead of shared memory has several advantages:

  1. Shuffle does not use any shared memory.

  2. Shuffle increases the effective bandwidth because it replaces a multi-instruction shared memory sequence with a single instruction.

  3. Synchronization is implicit in the instruction within a warp, there is no need to synchronize the whole thread block.

Note:

The shuffle intrinsics require compute capability 3 or higher.

The usage of the shuffle down intrinsic is illustrated in the generic warp reduce with shuffle is in the sample gallery.

Atomic Functions

An atomic function performs a read-modify-write atomic operation. An atomic operation is guaranteed to be performed without interference from other threads. Alea GPU support all the atomic functions

  • DeviceFunction.AtomicAdd
  • DeviceFunction.AtomicSub
  • DeviceFunction.AtomicExchange
  • DeviceFunction.AtomicMin
  • DeviceFunction.AtomicMax
  • DeviceFunction.AtomicIncrease
  • DeviceFunction.AtomicDecrease
  • DeviceFunction.AtomicCompareAndSwap
Note:

The atomic operations require compute capability 3 or higher.

Details are in the NVIDIA CUDA Programming Guide.

Warp Vote Functions

Alea GPU supports all the warp vote functions

  • DeviceFunction.All
  • DeviceFunction.Any
  • DeviceFunction.Ballot

These functions take an integer predicate from each thread in the warp and compare it with zero. The results of the comparisons are combined across the active threads of the warp according to the specific vote function and a single return value is broadcasted to each thread. We again refer to the NVIDIA CUDA Programming Guide.

Warp Lane Mask Properties

  • DeviceFunction.LaneMaskLessThan
  • DeviceFunction.LaneMaskLessOrEqual
  • DeviceFunction.LaneMaskGreaterThan
  • DeviceFunction.LaneMaskGreaterOrEqual

NVIDIA LibDevice Functions

The NVIDIA LibDevice functions are exposed in LibDevice. The function names start with the prefix __nv_. They are all non-generic and have an implementation __nv_xxxf for single and __nv_xxx for double precision. We refer to the NVIDIA LibDevice User's Guide.