F# Advanced GPU Programming

This section discusses the following advanced F# GPU programming features:

GPU Module

Sometimes it is useful to group GPU functions, global symbols and host routines together in a GPU module. This can be achieved with the gpumodule workflow, which produces a GPU module template. At runtime, this template is applied to a concrete GPU instance, which is subsequentially JIT compiled and loaded. The following GPU resources can be defined in a GPU module template:

  • Kernels, defined by Compiler.DefineKernel.
  • Device functions, defined by DefineFunction.
  • Global GPU symbols, such as constant memory symbols, defined by Compiler.DefineConstantArray and Compiler.DefineConstantVariable.
  • Entry function Entry<'T>, which should always be the last resource returned by the gpumodule workflow. It is a host function which provides the entry point of the GPU module.

Alea GPU uses F# computation expressions to implement the gpumodule workflow.

Let's revisit the parallel transform introduced in section GPU Programming with F# and recast it as a GPU module. Although there is only resource, the GPU kernel function, it may be still useful to encapsulate the launching details, such as the GPU memory copy, allocation and freeing, as well as the calculation of the launch parameters.

let kernel (result:int[]) (arg1:int[]) (arg2:int[]) =
    let start = blockIdx.x * blockDim.x + threadIdx.x
    let stride = gridDim.x * blockDim.x
    let mutable i = start
    while i < result.Length do
        result.[i] <- arg1.[i] + arg2.[i]
        i <- i + stride

let transformModule = gpumodule {
    let! kernel = <@ kernel @> |> Compiler.DefineKernel

    return Entry(fun program ->
        let gpu = program.Gpu
        let kernel = program.Apply kernel

        let divup (num:int) (den:int) = 
            (num + den - 1) / den

        let transform (arg1:int[]) (arg2:int[]) =
            let length = arg1.Length
            let darg1 = gpu.Allocate(arg1)
            let darg2 = gpu.Allocate(arg2)
            let dresult = gpu.Allocate<int>(length)

            let blockSize = 256
            let gridSize = min 256 (divup length blockSize)
            let lp = LaunchParam(gridSize, blockSize)
            kernel.Launch lp dresult darg1 darg2

            let result = Gpu.CopyToHost(dresult)

        transform) }

In the gpumodule workflow, the kernel symbol is defined. Then the entry point is returned. The Entry<'T> constructor takes a program instance from which the actual GPU instance can be retrieved. The Program.Apply method is applied to the kernel symbol to turn the kernel into a launchable GPU function on that GPU instance. The function transform encapsulates the GPU memory management and the calculation of the launch parameters and is returned as the entry function of the GPU module.

let runGpuModule() =
    let arg1 = Array.init length id
    let arg2 = Array.init length id
    let expected = (arg1, arg2) ||> Array.map2 (+)

    let gpu = Gpu.Default
    let transform = gpu.EntryOf transformModule
    let actual = transform arg1 arg2

    Assert.AreEqual(expected, actual)

To use this GPU module template, the methhod Gpu.EntryOf is called to retrieve the entry function. Every GPU instance caches the JIT compilation result of this GPU module template, based on the template instance in a similar way as described in the section JIT Compilation. This also means that GPU module template instances should be created as global values to avoid unnecessary repetitive JIT compilation.

The code for this sample is in the sample gallery.

Besides the benefit of bundling GPU resources and algorithms together in a GPU module, GPU module templates also allow advanced control of compilation and linking process.

GPU Pointers and Memory

The pointer type deviceptr<'T> can be used to access data with a pointer. In GPU kernels pointers are more flexible than .NET arrays:

  • Pointer offseting and pointer arithmetics
  • Reinterpret to different types
  • Volatile data access for special use cases

To access GPU device memory with a pointer the memory is first allocated with Gpu.AllocateDevice() instead of Gpu.Allocate(). It returns a DeviceMemory object, which implements the IDisposable interface. Hence, the life time of the device memory can be conveniently controlled with the use keyword. The parallel transform using pointers instead of .NET arrays looks as follows:

let kernelPtr =
    <@ fun (result:deviceptr<int>) (arg1:deviceptr<int>) (arg2:deviceptr<int>) (length:int) ->
        let start = blockIdx.x * blockDim.x + threadIdx.x
        let stride = gridDim.x * blockDim.x
        let mutable i = start
        while i < length do
            result.Set(i, arg1.[i] + arg2.[i])
            i <- i + stride @>
    |> Compiler.makeKernel

let runPtr() =
    let harg1 = Array.init length id
    let harg2 = Array.init length id
    let expected = (harg1, harg2) ||> Array.map2 (+)

    // allocate gpu memory
    let gpu = Gpu.Default
    use darg1 = gpu.AllocateDevice(harg1)
    use darg2 = gpu.AllocateDevice(harg2)
    use dresult = gpu.AllocateDevice<int>(length)

    // determine launching parameters and launch kernel
    let lp = LaunchParam(16, 256)
    gpu.Launch kernelPtr lp dresult.Ptr darg1.Ptr darg2.Ptr length
    // copy data back from gpu
    let actual = Gpu.CopyToHost(dresult)
    Assert.That(actual, Is.EqualTo(expected))

The code for this sample is in the sample gallery.


The type deviceptr<'T> is a struct value type. This means that it is not possible to directly write code like dresult.[i] <- .... The reason is that the pointer value in F# is immutable. The proper way to set value is to use deviceptr<'T>.Set(idx, value) or to define the pointer value as mutable.

In some cases a volatile pointer is required to prevent that the compiler performs register optimizations. A common use case is warp reduce with shared memory:

let reduce (op:'T -> 'T -> 'T) (input:'T) (warpId:int) =
    let shared = __shared__.ExternArray<'T>() |> __address_of_array
    let shared = (shared + warpId*(WarpSize + WarpSize/2)).Volatile()
    let laneId = threadIdx.x &&& 0x1f
    let mutable reduction = input
    for step = 0 to Steps - 1 do
        let offset = 1 <<< step
        shared.Set(laneId, reduction)
        let peer = shared.[laneId + offset]
        reduction <- op reduction peer


The shared memory is accessed through a pointer obtained from an array in shared memory with a call to __address_of_array. The code calculates the correct location for a warp with shared + warpId*(WarpSize + WarpSize/2) using pointer offsetting. Finally the pointer is converted into a volatile pointer with deviceptr<'T>.Volatile().

Device Functions

Most F# operators can be used in a GPU kernel. The appendix lists the supported F# operators. Moreover all functions in the section on C# device functions are also available in F#. There are also some functions in auto-opend module Intrinsics which can be called in them in the F# style:

Synchronization functions

  • __syncthreads
  • __syncthreads_count
  • __syncthreads_and
  • __syncthreads_or

Memory fence functions

  • __threadfence_block
  • __threadfence
  • __threadfence_system

Atomic functions

  • __atomic_add
  • __atomic_sub
  • __atomic_exch
  • __atomic_min
  • __atomic_max
  • __atomic_inc
  • __atomic_dec
  • __atomic_cas

Warp vote functions

  • __all
  • __any
  • __ballot

Warp lane mask properties

  • __lanemask_lt
  • __lanemask_le
  • __lanemask_gt
  • __lanemask_ge

Custom Types

It is possible to use struct types in GPU kernels in F# as in the following example:

type ShortSingle =
    val mutable x : int16
    val mutable y : float32

    new (x, y) = { x = x; y = y }

Constructors are created with quotations. Alternatively, a struct value can be created directly with syntax { field = value; ... }.


It is not possible to add the ReflectedDefinition attribute on an instance method of a struct. The proper way is to create static method with the ReflectedDefinition attribute, which takes the struct as an argument, by value or by reference with Ref<'T> type.

F# does not support the unsafe keyword to create a blittable struct types which contain an embedded fixed-size array. Alea GPU provides some attributes to support embedded arrays:

[<Struct;StructLayout(LayoutKind.Explicit, Size=36)>]
type StatesAndIndex =
    [<FieldOffset(0);StructEmbeddedArrayField(8)>] val mutable State0 : int
    [<FieldOffset(32)>] val mutable Index : int

    member this.State
        with get (idx:int) : int = NativePtr.get &&this.State0 idx
        and set (idx:int) (value:int) : unit = NativePtr.set &&this.State0 idx value

    member this.States : int[] =
        [| this.State(0)
           this.State(7) |]

First, it is necessary to use an explicit layout to create room for the array members. Second, the attribute StructEmbeddedArrayField must be added to the field that is supposed to be the first element of the array. Third, either the StructEmbeddedArrayItem or StructEmbeddedArrayProperty attribute must be added to mark properties which access the array. If these properties are only used in GPU code the CPU implementation can be skiped.

[<Struct;StructLayout(LayoutKind.Explicit, Size=36)>]
type StatesAndIndex =
    [<FieldOffset(0);StructEmbeddedArrayField(8)>] val mutable State0 : int
    [<FieldOffset(32)>] val mutable Index : int

    member this.State
        with get (idx:int) : int = failwith "device only"
        and set (idx:int) (value:int) : unit = failwith "device only"

    member this.States : int[] = failwith "device only"

The Align attribute is used to specify the memory alignment. Valid alignent values are 1, 2, 4, 8 and 16. Note that the alignment must be equal to or greater than the natural alignment of the type that is marked.

type IntVector2 =
    val mutable x : int
    val mutable y : int
    new (x, y) = { x = x; y = y }

Alea GPU also provides vector types such as the CUDA built-in vector types.

Printing in Kernels

The functions printfn and printf can be used in GPU kernels for type safe printing. The can even be partially apply it to get a function value which is then used subsequentially in the kernel.

The supported argument types are:

  • bool
  • integers (int8/int16/int32/int64/uint8/uint16/uint32/uint64)
  • floating point number (single/double)
  • string
  • deviceptr

To format the argument, Alea GPU supports parts of the F# formatting:

  • bool : support O A
  • integers : support O A d i u o x X
  • floating point number : support O A f F g G e E
  • string : support O A s
  • deviceptr : support O A

Here is an example how to use the print capabilities in a GPU kernel:

    let testBool = threadIdx.x % 2 = 0
    printfn "Hello World!"
    printfn "Pointer: %A %O" ptr ptr
    printfn "Boolean: %A %O" testBool (not testBool)
    printfn "Double : %f %F %e %E %.3e %.3E %g %G %A %O" d d d d d d d d d d
    printfn "Integer: %04d %i %u %x %X %o %A %O" i i i i i i i i
    let partial = printfn "blk(%03d) thd(%03d) %s" blockIdx.x threadIdx.x
    partial "string1"
    partial "string2"
    // outputs
    // Hello World!
    // Pointer: 0000000703640000 0000000703640000
    // Boolean: 1 0
    // Double : 4.200000 4.200000 4.200000e+000 4.200000E+000 4.200e+000 4.200E+000 4.2 4.2 4.2 4.2
    // Integer: -003 -3 4294967293 fffffffd FFFFFFFD 37777777775 -3 -3
    // blk(000) thd(000) string1
    // blk(000) thd(000) string2

The GPU device cannot directly output strings to the host's standard output device. Instead, it allocates a buffer in GPU device memory and writes the messages to the buffer. When certain events are triggered (see below), the CUDA driver gathers the messages from the GPU buffer and sends them to the host's standard output device.

The print buffer on the GPU device is a circular buffer with fixed size. If the buffer is full, old output will be overwritten. The size of this buffer can be read and written through the Gpu.PrintBufferSize property.

The events that trigger buffer flushing are:

  • the start of a kernel launching
  • synchronization (e.g. Gpu.Synchronize())
  • blocking memory copies (e.g. Gpu.Copy())
  • GPU module loading and unloading
  • context destruction (e.g. disposing a GPU instance)

Note that printing actually happens in the CUDA driver, which is unmanaged code. This means that some application, such as the output window of the Resharper unit test runner, might not get the print output. A solution to this problem is to run the application from a Windows command line (or Posix system's terminal). Alternatively, it is possible to use the SetStdHandle function to redirect the unmanaged standard output device to a file. More detail can be found here and here.