Appendix

Implementation Restrictions


The current version has the following limitations:

  • Constructing a structure with a non-default constructor is not yet supported.

Supported Types


Native Functions


Supported Math Functions


The following functions are generic overloads of the corresponding NVIDIA LibDevice math functions:

  • DeviceFunction.Abs
  • DeviceFunction.Acos
  • DeviceFunction.Asin
  • DeviceFunction.Atan
  • DeviceFunction.Atan2
  • DeviceFunction.Ceiling
  • DeviceFunction.Cos
  • DeviceFunction.Cosh
  • DeviceFunction.Exp
  • DeviceFunction.Floor
  • DeviceFunction.Log
  • DeviceFunction.Log10
  • DeviceFunction.Max
  • DeviceFunction.Min
  • DeviceFunction.Pow
  • DeviceFunction.Sin
  • DeviceFunction.Sinh
  • DeviceFunction.Sqrt
  • DeviceFunction.Tan
  • DeviceFunction.Tanh
  • DeviceFunction.Truncate

The following functions from System.Math are supported in a GPU kernel and mapped to the corresponding NVIDIA LibDevice function:

  • Math.Round(single/double)
  • Math.Sin(single/double)
  • Math.Cos(single/double)
  • Math.Log(single/double)
  • Math.Exp(single/double)
  • Math.Sqrt(single/double)

Supported F# Operators


The following F# operators are supported within a kernel:
FunctionDescription
unary pluse.g., +1, +2.0
unary negatione.g., -1, -2.0
logical not(~)
notboolean negation
int8convert to int8
int16convert to int16
intconvert to int
int64convert to int64
uint8convert to uint8
uint16convert to uint16
uint32convert to uint32
uint64convert to uint64
float32convert to float32
floatconvert to float
enumconvert to enum
absabsolute value function
floorfloor function
ceilceiling function
truncatetruncation function
sqrtsquare root function
expexponential
lognatural log
log10base 10 logarithm
sinsine
sinhhyperbolic sine
asinarcsine, inverse sine
coscosine
coshhyperbolic cosine
acosarccosine, inverse cosine
tantangent
tanhhyperbolic tangent
atanarctangent, inverse tangent
+
-
*
/
%modulus or quotation splicing
=returns true if the left side is equal to the right side
<>returns true if the left side is not equal to the right side
<less than
<=less than or equal to
>greater than
>=greater than or equal to
<<<left shift; x <<< y shifts x left by y bits
>>>right shift; x >>> y shifts x right by y bits
&&&bitwise AND
|||bitwise OR
^^^exclusive OR
minmin(x,y) returns the minimum of x and y
maxmax(x,y) returns the maximum of x and y
powpow(x,y) raises x to the yth power
|>forward pipe
||>2 argument forward pipe
|||>3 argument forward pipe
refused to create reference cells
!dereferences a reference cell; after a keyword, indicates a modified version of the keyword's behavior as controlled by a workflow

The CUDA Programming Model


NVIDIA introduced the CUDA parallel programming model in late 2006 to facilitate the programming of GPUs. A detailed explanation can be found in the CUDA Programming Guide. The CUDA programming model relies on several key concepts:

  • Kernel functions executed in parallel by multiple threads
  • A thread hierarchy of thread blocks and a grid of thread blocks
  • Shared memory assigned to thread blocks
  • Barrier synchronization

The thread hierarchy provides two levels of parallelism:

  • Fine-grained data parallelism and thread parallelism for threads of a block with synchronisation and data exchange between threads
  • Coarse-grained data parallelism and task parallelism for block of threads executing independently

This allows to partition large computations into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into smaller pieces that can be solved cooperatively in parallel by all threads within a block, exploiting fine-grained parallelism, data exchange and synchronization.

The CUDA programming model enables scalability: each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a CUDA program can execute on any number of multiprocessors.

In the CUDA parallel programming model, an application consists of a sequential host program and multiple parallel programs called kernels. A kernel is a single function or program that is executed N times by N threads in parallel.

The thread that executes a kernel is identified by a thread index. Similarly, the thread block of the thread has a block index within the grid of blocks. A kernel can interrogate its thread and block indices through the built in variables threadIdx and blockIdx. The dimension of the block and the grid can be queried from the variable blockDim, respectively gridDim.

Built in VariableDescription
threadIdx.x threadIdx.y threadIdx.zthread index within the thread block
blockDim.x blockDim.y blockDim.zdimension of the thread block
blockIdx.x blockIdx.y blockIdx.zblock index within the grid
gridDim.x gridDim.y gridDim.zdimension of the grid

The three dimensional index scheme provides more flexibility to map a computational problem to parallel threads.

A kernel is executed by multiple thread blocks of identical dimension. Consequently the total number of threads is the number of blocks times the number of threads per block. There is a limit to the number of threads per block which depends on the compute capability of the device.

Thread blocks execute independently of each other, without any given ordering, sequentially or in parallel. On the other hand, threads within a block can cooperate by sharing data through shared memory and can be synchronized.

The CUDA execution model assigns thread blocks to streaming multiprocessors. During execution the threads of a block are grouped into warps. Multiprocessors on the GPU then execute instructions for each warp in a single instruction multiple data (SIMD) fashion. The warp size or the SIMD width of all current CUDA-capable GPUs is 32 threads.

TermExplanation
hosthost machine to which the GPU device is connected as a coprocessor
devicesynonym for a CUDA enabled GPU
kernelfunction being executed in parallel by multipe threads
threadlight weight process on the device executing a kernel
thread blockset of threads which can be synchronized and exchange data between each other
warpminimal scheduling unit of a group of 32 threads executing a kernel in a single instruction multiple data fashion
gridset of blocks executing a single kernel.
device memoryoff-chip memory that is accessible to all threads and to the host
global memorysynonym for device memory
coalesced memory accessmultiple global memory accesses that are conglomerated by the device into a single memory transaction; requires appropriate memory access alignment and contiguity
streaming multiprocessor (SM)multiprocessor with an array of scalar processors, shared memory, registers, cache for texture and constant memory. Thread blocks are scheduled to be executed on SMs.
scalar processor (SP)scalar processors in an SM; a SP executes a warp in a SIMD fashion
occupancymeasure of effectiveness for a kernel defined as the ratio active warps per SM / maximal warps per SM

Compute Capability


The general specifications and features of a CUDA GPU device depend on its compute capability. They are documented in the CUDA Programming Guide. The major versions are:

GPU Memory Types


There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior.

GPU Memory Types

Registers: Registers are the fastest memory. In most cases a register access consumes zero clock cycles per instruction. Delays can occur due to read after write dependencies and bank conflicts. Registers are allocated to individual threads. Each thread can access only its own registers. A kernel relies on registers to hold frequently accessed variables that are private to each thread.

Shared memory: Shared memory is comparable to L1 cache memory on a regular CPU. It is on-chip memory close to the multiprocessor, and has very short access times and high bandwidth. Shared memory is shared among all the threads of a given block. If the shared memory can be accessed without bank conflicts, its performance is comparable to register memory.

Global memory: Global memory resides in device DRAM and is used for transfers between the host and device. The name global here refers to scope, as it can be accessed and modified from both the host and the device. All threads in a kernel have access to all data in global memory. Global memory allocations can persist for the lifetime of the application. Depending on the compute capability of the device, global memory may or may not be cached on the chip.

Local Memory: Each thread has private local memory stored in global memory. Variables are stored in the thread local memory if there are not enough registers available. This memory is slow, even though it’s called local.

Constant Memory: 64 KB of constant memory resides off-chip. The host code writes to constant memory before launching the kernel, and the kernel may then read this memory. Each SM can cache up to 8 KB constant memory, so that subsequent reads from constant memory can be very fast. All threads have access to constant memory.

Texture Memory: Texture memory is specialized memory for surface texture mapping but can also be used for read only data. Texture memory has additional addressing modes and provides data filtering and interpolation.

MemoryLocationCachedAccessScopeLifetime
Registeron chipn/ar/wone threadthread
Localoff chipyes (sm20)r/wone threadthread
Sharedon chipn/ar/wall threads in blockblock
Deviceoff chipyesr/wall threads and hosthost allocation
Constantoff chipyesrall threads and hosthost allocation
Textureoff chipyesrall threads and hosthost allocation

Global Device Memory

Grouping of threads into warps is not only relevant to computation, but also to global memory accesses. A GPU device coalesces multiple global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth if certain access requirements are met. The access requirements for coalescing depend on the compute capability

The CUDA Best Practice Guide explains memory coalescing concepts and illustrats them with examples.

Shared Memory

Shared memory is one of the key components of a GPU. Physically each streaming multiprocessor (SM) has a small on-chip low latency memory pool that is shared among all threads of a thread block currently executing on that SM. Its latency is about 100x lower than uncached global memory latency and bandwidth is about 10 times higher. Shared memory allows threads within the same thread block to cooperate, exchange data and reuse on-chip data, which can significantly reduce the global memory bandwidth requirements of a kernel.

A fixed amount of shared memory is allocated to each thread block at startup. Its content has the same lifetime as the thread block. It is shared by all threads in the thread block. Shared memory is partitioned among all resident thread blocks on an SM. Therefore, the more shared memory is used by a kernel the fewer thread blocks can be active at the same time on an SM. Hence, shared memory is a critical resource that limits device parallelism.

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory banks which can be accessed simultaneously. Shared memory access requests are issued per warp. If all threads of a warp access different memory banks the access can be serviced in a single transaction. However, if multiple threads access memory on the same bank the accesses are serialized, which decreases the effective bandwidth. In the worst case a request is serialized into 32 transactions. In order to reduce bank conflicts, it is important to understand how addresses map to memory banks. This depends on the compute capability. All GPU devices of compute capability 2 or newer have 32 shared memory banks. Successive 32-bit words are assigned to successive banks. Compute capability 3 has in addition a 64 bit mode where successive 64-bit words map to successive banks. Bank conflicts to not occure if two threads of a warp access any address within the same 32-bit word (64-bit word in the 64-bit mode of compute capability 3). For read access the word is broadcasted to the requesting threads and for write access it is only written by one thread.

More details are in the NVIDIA programming documentation

Pinned Memory

Host memory is pageable by default. The GPU cannot access data directly from pageable host memory. When a data transfer from pageable host memory to device memory is invoked, the CUDA driver first allocates a temporary page-locked, or pinned, host buffer. It then copies the host data into this buffer and transfers from this pinned memory to device memory:

Copy from Pageable/Pinned Memory

Allocating the data on the host side directly in pinned memory avoids the cost of the transfer from pageble to pinned host memory.

Recommendation:

Do not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system.

Unified Memory Support

CUDA unified memory creates a pool of managed memory that is shared between the CPU and GPU and is accessible to both the CPU and GPU using a single pointer. The CUDA runtime system automatically migrates data allocated in unified memory between host and device at the page level, so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. Further details are in the CUDA Programming Guide.

CUDA Streams


The CUDA programming model manage concurrency by executing asynchronous commands in streams. A stream represents a sequences of commands that execute in order. Different streams may execute their commands concurrently or out-of-order with respect to each other. This is useful for heterogeneous computing, where applications want to execute functions concurrently using all processors of the system, including CPUs and GPUs.

GPU devices of compute capability 2 and higher can

  • Execute multiple kernels concurrently
  • Perform an asynchronous memory copy to or from the GPU concurrently with kernel execution
  • Overlap memory copy operations to and from the device
  • Perform an intra-device copy simultaneously with kernel execution

These concurrent tasks are supported through asynchronous functions that return control to the host thread before the device completes the requested task. In such a way many device operations can be queued up and executed when appropriate device resources are available.