Overview

This section gives an overview how to program GPUs with F#. We cover the following topics:

Note:

We use the term device synonymously for GPU and host for CPU to emphasize the fact that a GPU is often a physically separate device that operates as a coprocessor to the host.

GPU Programming with F#


GPU programming with F# relies on the CUDA Programming Model and F# code quotations, a language feature that allows to generate and work with F# code expressions programmatically. Code quotations are created with the symbols <@ and @> to delimit the quoted expression.

We explain the basic principles of F# GPU programming by means of a parallel transform with threads organized in a one-dimensional grid of blocks. Here is a schematic illustration how such grid of thread blocks processes the elements of an array in three iterations:

Transform

The actual GPU kernel is an F# code quotation of a function that takes two input arrays arg1 and arg2 and the array for the result.

let kernel = 
    <@ fun (result:int[]) (arg1:int[]) (arg2:int[]) ->
        let start = blockIdx.x * blockDim.x + threadIdx.x
        let stride = gridDim.x * blockDim.x
        let mutable i = start
        while i < result.Length do
            result.[i] <- arg1.[i] + arg2.[i]
            i <- i + stride @>
    |> Compiler.makeKernel

The kernel is designed to be launched with a one-dimensional grid of one-dimensional thread blocks, so that the total number of threads working in parallel is gridDim.x*blockDim.x. Each of these threads that runs the kernel calculates the index of the first element it has to process and stores it in start. It then loops through the data with an increment of stride until no more elements need to be processed.

The code quotation representing the kernel is sent to Compiler.makeKernel. The result is an object of type KernelDef<int[] -> int[] -> int[] -> unit> representing a kernel definition. The type parameter of KernelDef is the signature of the kernel function in the quotation. Note that a kernel cannot return value, hence the return type of a kernel is always unit.

The function in the code quotations shows how .NET arrays can be used in GPU kernels. Some array methods such as Array.Length, Array.LongLength, Array.Rank, Array.GetLength(int dim) or Array.GetLongLength(int dim) can be called in the kernel. In case .NET arrays are not flexible enough, it is also possible to use the pointer type deviceptr<'T>, which facilitates pointer arithmetics and pointer reinterpretation.

It is worth to note that the thread in the CUDA Programming Model is an abstract parallel thread. You may not have gridDim.x*blockDim.x threads running concurrently. The actual number of threads running in parallel depends on the GPU hardware. The GPU schedules as many threads as possible to execute the kernel in parallel. The more powerful the GPU is, the more parallelism is gained.

The next step is to execute or launch the kernel. A GPU is a coprocessor with its own processing units and dedicated GPU memory. The process to execute a GPU kernel is:

  1. Allocate memory on the GPU and copy data from the host memory to the GPU memory.
  2. Launch the GPU kernel by providing the launch parameters, which define the number of threads per block and the grid size.
  3. Copy back the results from GPU memory to CPU memory.
  4. Release all GPU resources which are not used anymore.

Here is the code that launches the parallel transfrom GPU kernel.

[<Test>]
let run() =
    let harg1 = Array.init length id
    let harg2 = Array.init length id
    let expected = (harg1, harg2) ||> Array.map2 (+)

    let gpu = Gpu.Default
    let darg1 = gpu.Allocate(harg1)
    let darg2 = gpu.Allocate(harg2)
    let dresult = gpu.Allocate<int>(length)

    let lp = LaunchParam(16, 256)
    gpu.Launch kernel lp dresult darg1 darg2

    let actual = Gpu.CopyToHost(dresult)

    Gpu.Free(darg1)
    Gpu.Free(darg2)
    Gpu.Free(dresult)

    Assert.That(actual, Is.EqualTo(expected))

The static property Gpu.Default returns a suitable default GPU. More details about the selection algorithm are explained in the section on using multiple GPUs. The LaunchParam determines the number of threads in a block and the blocks in the grid. There are some tricks to decide suitable launch parameters. In essence, the number of threads in a block is usually determined by the occupancy of the kernel and the number of blocks, i.e. the grid size, is determined by the data size to be processed. More details can be found in Maximize Utilization - Multiprocessor Level. In the parallel transform example we choose a fixed grid size of 16 blocks, each consisting 256 threads. The kernel is launched on the selected GPU by calling the Gpu.Launch method, which has the signature

Gpu.Launch : KernelDef<'T> -> (LaunchParam -> 'T)

The code for this sample can be found in the sample gallery.

Note:

As of this version of Alea GPU automatic memory management can not be use with F#.

JIT Compilation

Section GPU Programming with F# explained how to create a kernel definition from a kernel function quotation. The function Compiler.makeKernel returns an object representing the kernel definition. Once a kernel definition is applied to the Gpu.Launch method, AleaGPU will first check if this kernel definition is compiled and loaded in the GPU instance on which the launch method is called. If this is not the case Alea GPU will JIT compile the kernel definition.

For this reason repetitive JIT compilations can be avoided if the kernel is defined statically, e.g. it is a good practice to create kernel definition objects as global values. The following code demonstrates how to avoid unnecessary JIT compilation:

// create a simple kernel, with quotation through ReflectedDefinition attribute
[<ReflectedDefinition>]
let kernel (data:int[]) =
    data.[0] <- data.[0] + 1

[<Test>]
let jitCompileMultipleTimes() =
    let gpu = Gpu.Default
    let data = gpu.Allocate<int>(1)
    let lp = LaunchParam(1, 1)

    // launch the kernel for the first time
    // kernelDef is a new instance, a JIT compilation will happen
    Gpu.Copy([| 10 |], data)
    let kernelDef = <@ kernel @> |> Compiler.makeKernel
    gpu.Launch kernelDef lp data

    // launch the kernel for the second time
    // kernelDef is again a new instance, a JIT compilation will happen
    Gpu.Copy([| 20 |], data)
    let kernelDef = <@ kernel @> |> Compiler.makeKernel
    gpu.Launch kernelDef lp data

    // launch the kernel for the third time
    // kernelDef is again a new instance, a JIT compilation will happen
    Gpu.Copy([| 30 |], data)
    gpu.Launch (<@ kernel @> |> Compiler.makeKernel) lp data

    gpu.Synchronize()
    Gpu.Free(data)

// create kernel definition object as a global value
let kernelDef = <@ kernel @> |> Compiler.makeKernel

[<Test>]
let jitCompileCached() =
    let gpu = Gpu.Default
    let data = gpu.Allocate<int>(1)
    let lp = LaunchParam(1, 1)

    // launch the kernel for the first time
    // kernelDef hasn't been compiled and loaded before in this GPU,
    // a JIT compilation will hapen
    Gpu.Copy([| 10 |], data)
    gpu.Launch kernelDef lp data

    // launch the kernel for the second time
    // kernelDef is compiled and loaded already,
    // thus no JIT compilation will happen
    Gpu.Copy([| 20 |], data)
    gpu.Launch kernelDef lp data

    // launch the kernel for the third time
    // kernelDef is compiled and loaded already,
    // thus no JIT compilation will happen
    Gpu.Copy([| 30 |], data)
    gpu.Launch kernelDef lp data

    gpu.Synchronize()
    Gpu.Free(data)

Generics


Generics are useful and also can be used in GPU kernels. We illustrate it with a generic transform kernel.

let kernel (op:Expr<'T -> 'T -> 'T>) =
    <@ fun (result:'T[]) (arg1:'T[]) (arg2:'T[]) ->
        let start = blockIdx.x * blockDim.x + threadIdx.x
        let stride = gridDim.x * blockDim.x
        let mutable i = start
        while i < result.Length do
            result.[i] <- (%op) arg1.[i] arg2.[i]
            i <- i + stride @>

The function kernel takes the transform operation expression as an argument and splices it into the transform function quotation with the F# quotation splicing operator.

As explained in the section JIT Compilation, we should create kernel definition objects globally:

let kernelAddI32 : KernelDef<int[] -> int[] -> int[] -> unit> = 
    kernel <@ (+) @> |> Compiler.makeKernel

let kernelSubF64 : KernelDef<float[] -> float[] -> float[] -> unit> =
    kernel <@ (-) @> |> Compiler.makeKernel

Note that we provide the signature of the kernel definition explicitly so that the desired overloading of the generic operators + and - is used.

A common problem with .NET generics that arithmetic operations on generic data types require an implementation which has to be provided explicitly. F# inline function can solve this problem more elegantly. The following code illustrates how to use this technique to implement kernels using arithmetic operations on generic data types:

[<ReflectedDefinition>]
let inline kernelCustom (result:'T[]) (arg1:'T[]) (arg2:'T[]) =
    let start = blockIdx.x * blockDim.x + threadIdx.x
    let stride = gridDim.x * blockDim.x
    let mutable i = start
    while i < result.Length do
        result.[i] <- 2G * (arg1.[i] + arg2.[i] - __gconv 0.5)
        i <- i + stride

let kernelCustomF32 : KernelDef<float32[] -> float32[] -> float32[] -> unit> = 
    <@ kernelCustom @> |> Compiler.makeKernel

let kernelCustomF64 : KernelDef<float[] -> float[] -> float[] -> unit> =
    <@ kernelCustom @> |> Compiler.makeKernel

Because F# is strongly typed we have to apply some additiona type conversion. For example the symbol 2G is an F# generic literal representing the integer value 2. The function __gconv : 'a -> 'b provided by Alea GPU is another convenience function to do type conversion.

The full code for this sample can be found in the sample gallery.

Debugging


Debugging GPU kernels written in F# with the NVIDIA Nsight debugger is similar to debugging C# GPU code.

The first step to enable debugging is to set the --quotations-debug flag in F# compilation settings:

Config QUotations Debug

Second, as for C#, set the JIT compilation level to Diagnostic in the App.config file:

<configuration>
    <configSections>
        <section name="aleaSettings" type="Alea.Settings, Alea"/>
    </configSections>
    <aleaSettings>
        <jitCompile level="Diagnostic"/>
    </aleaSettings>
</configuration>

To start debugging, set a breakpoint and then choose Start CUDA Debugging:

Start Debug

Data and variables can be inspected in a warp watch window. It displays the values of variables and memory locations for all threads of a warp. This gives a good overview of the operations of the threads of a warp. The menu item NsightWindowsCUDA Warp Watch opens a warp watch window. It can also be used to display .NET arrays. For instance array->Pointer[idx] displays the element at index idx and array->Lengths[n] retrieves length of n.

Debug with Warp Watch