- Implementation Restrictions
- Supported Types
- Native Functions
- Supported Math Functions
- Supported F# Operators
- The CUDA Programming Model
- Compute Capability
- GPU Memory Types
- CUDA Streams
The current version has the following limitations:
- Constructing a structure with a non-default constructor is not yet supported.
Supported Math Functions
The following functions are generic overloads of the corresponding NVIDIA LibDevice math functions:
The following functions from
System.Math are supported in a GPU kernel and mapped to the corresponding NVIDIA LibDevice function:
Supported F# Operators
The following F# operators are supported within a kernel:
|e.g., +1, +2.0|
|e.g., -1, -2.0|
|convert to int8|
|convert to int16|
|convert to int|
|convert to int64|
|convert to uint8|
|convert to uint16|
|convert to uint32|
|convert to uint64|
|convert to float32|
|convert to float|
|convert to enum|
|absolute value function|
|square root function|
|base 10 logarithm|
|arcsine, inverse sine|
|arccosine, inverse cosine|
|arctangent, inverse tangent|
|modulus or quotation splicing|
|returns true if the left side is equal to the right side|
|returns true if the left side is not equal to the right side|
|less than or equal to|
|greater than or equal to|
|left shift; |
|right shift; |
|2 argument forward pipe|
|3 argument forward pipe|
|used to create reference cells|
|dereferences a reference cell; after a keyword, indicates a modified version of the keyword's behavior as controlled by a workflow|
The CUDA Programming Model
NVIDIA introduced the CUDA parallel programming model in late 2006 to facilitate the programming of GPUs. A detailed explanation can be found in the CUDA Programming Guide. The CUDA programming model relies on several key concepts:
- Kernel functions executed in parallel by multiple threads
- A thread hierarchy of thread blocks and a grid of thread blocks
- Shared memory assigned to thread blocks
- Barrier synchronization
The thread hierarchy provides two levels of parallelism:
- Fine-grained data parallelism and thread parallelism for threads of a block with synchronisation and data exchange between threads
- Coarse-grained data parallelism and task parallelism for block of threads executing independently
This allows to partition large computations into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into smaller pieces that can be solved cooperatively in parallel by all threads within a block, exploiting fine-grained parallelism, data exchange and synchronization.
The CUDA programming model enables scalability: each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a CUDA program can execute on any number of multiprocessors.
In the CUDA parallel programming model, an application consists of a sequential host program and multiple parallel programs called kernels. A kernel is a single function or program that is executed N times by N threads in parallel.
The thread that executes a kernel is identified by a thread index. Similarly, the thread block of the thread has a block index within the grid of blocks. A kernel can interrogate its thread and block indices through the built in variables
blockIdx. The dimension of the block and the grid can be queried from the variable
|Built in Variable||Description|
|thread index within the thread block|
|dimension of the thread block|
|block index within the grid|
|dimension of the grid|
The three dimensional index scheme provides more flexibility to map a computational problem to parallel threads.
A kernel is executed by multiple thread blocks of identical dimension. Consequently the total number of threads is the number of blocks times the number of threads per block. There is a limit to the number of threads per block which depends on the compute capability of the device.
Thread blocks execute independently of each other, without any given ordering, sequentially or in parallel. On the other hand, threads within a block can cooperate by sharing data through shared memory and can be synchronized.
The CUDA execution model assigns thread blocks to streaming multiprocessors. During execution the threads of a block are grouped into warps. Multiprocessors on the GPU then execute instructions for each warp in a single instruction multiple data (SIMD) fashion. The warp size or the SIMD width of all current CUDA-capable GPUs is 32 threads.
|host||host machine to which the GPU device is connected as a coprocessor|
|device||synonym for a CUDA enabled GPU|
|kernel||function being executed in parallel by multipe threads|
|thread||light weight process on the device executing a kernel|
|thread block||set of threads which can be synchronized and exchange data between each other|
|warp||minimal scheduling unit of a group of 32 threads executing a kernel in a single instruction multiple data fashion|
|grid||set of blocks executing a single kernel.|
|device memory||off-chip memory that is accessible to all threads and to the host|
|global memory||synonym for device memory|
|coalesced memory access||multiple global memory accesses that are conglomerated by the device into a single memory transaction; requires appropriate memory access alignment and contiguity|
|streaming multiprocessor (SM)||multiprocessor with an array of scalar processors, shared memory, registers, cache for texture and constant memory. Thread blocks are scheduled to be executed on SMs.|
|scalar processor (SP)||scalar processors in an SM; a SP executes a warp in a SIMD fashion|
|occupancy||measure of effectiveness for a kernel defined as the ratio active warps per SM / maximal warps per SM|
The general specifications and features of a CUDA GPU device depend on its compute capability. They are documented in the CUDA Programming Guide. The major versions are:
GPU Memory Types
There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior.
Registers: Registers are the fastest memory. In most cases a register access consumes zero clock cycles per instruction. Delays can occur due to read after write dependencies and bank conflicts. Registers are allocated to individual threads. Each thread can access only its own registers. A kernel relies on registers to hold frequently accessed variables that are private to each thread.
Shared memory: Shared memory is comparable to L1 cache memory on a regular CPU. It is on-chip memory close to the multiprocessor, and has very short access times and high bandwidth. Shared memory is shared among all the threads of a given block. If the shared memory can be accessed without bank conflicts, its performance is comparable to register memory.
Global memory: Global memory resides in device DRAM and is used for transfers between the host and device. The name global here refers to scope, as it can be accessed and modified from both the host and the device. All threads in a kernel have access to all data in global memory. Global memory allocations can persist for the lifetime of the application. Depending on the compute capability of the device, global memory may or may not be cached on the chip.
Local Memory: Each thread has private local memory stored in global memory. Variables are stored in the thread local memory if there are not enough registers available. This memory is slow, even though it’s called local.
Constant Memory: 64 KB of constant memory resides off-chip. The host code writes to constant memory before launching the kernel, and the kernel may then read this memory. Each SM can cache up to 8 KB constant memory, so that subsequent reads from constant memory can be very fast. All threads have access to constant memory.
Texture Memory: Texture memory is specialized memory for surface texture mapping but can also be used for read only data. Texture memory has additional addressing modes and provides data filtering and interpolation.
|Register||on chip||n/a||r/w||one thread||thread|
|Local||off chip||yes (sm20)||r/w||one thread||thread|
|Shared||on chip||n/a||r/w||all threads in block||block|
|Device||off chip||yes||r/w||all threads and host||host allocation|
|Constant||off chip||yes||r||all threads and host||host allocation|
|Texture||off chip||yes||r||all threads and host||host allocation|
Global Device Memory
Grouping of threads into warps is not only relevant to computation, but also to global memory accesses. A GPU device coalesces multiple global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth if certain access requirements are met. The access requirements for coalescing depend on the compute capability
The CUDA Best Practice Guide explains memory coalescing concepts and illustrats them with examples.
Shared memory is one of the key components of a GPU. Physically each streaming multiprocessor (SM) has a small on-chip low latency memory pool that is shared among all threads of a thread block currently executing on that SM. Its latency is about 100x lower than uncached global memory latency and bandwidth is about 10 times higher. Shared memory allows threads within the same thread block to cooperate, exchange data and reuse on-chip data, which can significantly reduce the global memory bandwidth requirements of a kernel.
A fixed amount of shared memory is allocated to each thread block at startup. Its content has the same lifetime as the thread block. It is shared by all threads in the thread block. Shared memory is partitioned among all resident thread blocks on an SM. Therefore, the more shared memory is used by a kernel the fewer thread blocks can be active at the same time on an SM. Hence, shared memory is a critical resource that limits device parallelism.
To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory banks which can be accessed simultaneously. Shared memory access requests are issued per warp. If all threads of a warp access different memory banks the access can be serviced in a single transaction. However, if multiple threads access memory on the same bank the accesses are serialized, which decreases the effective bandwidth. In the worst case a request is serialized into 32 transactions. In order to reduce bank conflicts, it is important to understand how addresses map to memory banks. This depends on the compute capability. All GPU devices of compute capability 2 or newer have 32 shared memory banks. Successive 32-bit words are assigned to successive banks. Compute capability 3 has in addition a 64 bit mode where successive 64-bit words map to successive banks. Bank conflicts to not occure if two threads of a warp access any address within the same 32-bit word (64-bit word in the 64-bit mode of compute capability 3). For read access the word is broadcasted to the requesting threads and for write access it is only written by one thread.
More details are in the NVIDIA programming documentation
Host memory is pageable by default. The GPU cannot access data directly from pageable host memory. When a data transfer from pageable host memory to device memory is invoked, the CUDA driver first allocates a temporary page-locked, or pinned, host buffer. It then copies the host data into this buffer and transfers from this pinned memory to device memory:
Allocating the data on the host side directly in pinned memory avoids the cost of the transfer from pageble to pinned host memory.
Do not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system.
Unified Memory Support
CUDA unified memory creates a pool of managed memory that is shared between the CPU and GPU and is accessible to both the CPU and GPU using a single pointer. The CUDA runtime system automatically migrates data allocated in unified memory between host and device at the page level, so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. Further details are in the CUDA Programming Guide.
The CUDA programming model manage concurrency by executing asynchronous commands in streams. A stream represents a sequences of commands that execute in order. Different streams may execute their commands concurrently or out-of-order with respect to each other. This is useful for heterogeneous computing, where applications want to execute functions concurrently using all processors of the system, including CPUs and GPUs.
GPU devices of compute capability 2 and higher can
- Execute multiple kernels concurrently
- Perform an asynchronous memory copy to or from the GPU concurrently with kernel execution
- Overlap memory copy operations to and from the device
- Perform an intra-device copy simultaneously with kernel execution
These concurrent tasks are supported through asynchronous functions that return control to the host thread before the device completes the requested task. In such a way many device operations can be queued up and executed when appropriate device resources are available.