thanks to nvidia ❤

8ae5fc5 over 2 years ago

4.54 kB

metadata

grand_parent: Extended API
parent: Synchronization Primitives
nav_order: 2

`cuda::barrier`

Defined in header <cuda/barrier>:

template <cuda::thread_scope Scope,
          typename CompletionFunction = /* unspecified */>
class cuda::barrier;

The class template cuda::barrier is an extended form of cuda::std::barrier that takes an additional cuda::thread_scope argument. It has the same interface and semantics as cuda::std::barrier, with the following additional operations.

Barrier Operations

| cuda::barrier::init | Initialize a cuda::barrier. (friend function) | | cuda::device::barrier_native_handle | Get the native handle to a cuda::barrier. (function template) |

NVCC `shared` Initialization Warnings

When using libcu++ with NVCC, a __shared__ cuda::barrier will lead to the following warning because __shared__ variables are not initialized:

warning: dynamic initialization is not supported for a function-scope static
__shared__ variable within a __device__/__global__ function

It can be silenced using #pragma diag_suppress static_var_with_dynamic_init.

To properly initialize a __shared__ cuda::barrier, use the cuda::barrier::init friend function.

Concurrency Restrictions

An object of type cuda::barrier or cuda::std::barrier shall not be accessed concurrently by CPU and GPU threads unless:

it is in unified memory and the concurrentManagedAccess property is 1, or
it is in CPU memory and the hostNativeAtomicSupported property is 1.

Note, for objects of scopes other than cuda::thread_scope_system this is a data-race, and thefore also prohibited regardless of memory characteristics.

Under CUDA Compute Capability 8 (Ampere) or above, when an object of type cuda::barrier<thread_scope_block> is placed in __shared__ memory, the member function arrive performs a reduction of the arrival count among coalesced threads followed by the arrival operation in one thread. Programs shall ensure that this transformation would not introduce errors, for example relative to the requirements of thread.barrier.class paragraph 12 of ISO/IEC IS 14882 (the C++ Standard).

Under CUDA Compute Capability 6 (Pascal) or prior, an object of type cuda::barrier or cuda::std::barrier may not be used.

Implementation-Defined Behavior

For each cuda::thread_scope S and CompletionFunction F, the value of cuda::barrier<S, F>::max() is as follows:

`cuda::thread_scope` `S`	`CompletionFunction` `F`	`barrier<S, F>::max()`
`cuda::thread_scope_block`	Default or user-provided	`(1 << 20) - 1`
Not `cuda::thread_scope_block`	Default	`cuda::std::numeric_limits<cuda::std::int32_t>::max()`
Not `cuda::thread_scope_block`	User-provided	`cuda::std::numeric_limits<cuda::std::ptrdiff_t>::max()`

Example

#include <cuda/barrier>

__global__ void example_kernel() {
  // This barrier is suitable for all threads in the system.
  cuda::barrier<cuda::thread_scope_system> a(10);

  // This barrier has the same type as the previous one (`a`).
  cuda::std::barrier<> b(10);

  // This barrier is suitable for all threads on the current processor (e.g. GPU).
  cuda::barrier<cuda::thread_scope_device> c(10);

  // This barrier is suitable for all threads in the same thread block.
  cuda::barrier<cuda::thread_scope_block> d(10);
}

See it on Godbolt{: .btn }

cuda::barrier