RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTEXECUTION.html below:

CUDA Runtime API :: CUDA Toolkit Documentation

__host__ â __device__ âcudaError_t cudaFuncGetAttributes ( cudaFuncAttributes*Â attr, const void*Â func )

Find out attributes for a given function.

attr: - Return pointer to function's attributes
func: - Device function symbol

This function obtains the attributes of a function specified via func. func is a device function symbol and must be declared as a __global__ function. The fetched attributes are placed in attr. If the specified function does not exist, then it is assumed to be a cudaKernel_t and used as is. For templated functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>

Note that some function attributes such as maxThreadsPerBlock may vary based on the device that is currently being used.

Note:

Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a function as the func parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

__host__ âcudaError_t cudaFuncGetName ( const char**Â name, const void*Â func )

Returns the function name for a device entry function pointer.

name: - The returned name of the function
func: - The function pointer to retrieve name for

Returns in **name the function name associated with the symbol func . The function name is returned as a null-terminated string. This API may return a mangled name if the function is not declared as having C linkage. If **name is NULL, cudaErrorInvalidValue is returned. If func is not a device entry function, then it is assumed to be a cudaKernel_t and used as is.

Note:

Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

cudaFuncGetName (C++ API)

__host__ âcudaError_t cudaFuncGetParamInfo ( const void*Â func, size_tÂ paramIndex, size_t*Â paramOffset, size_t*Â paramSize )

Returns the offset and size of a kernel parameter in the device-side parameter layout.

func: - The function to query
paramIndex: - The parameter index to query
paramOffset: - The offset into the device-side parameter layout at which the parameter resides
paramSize: - The size of the parameter in the device-side parameter layout

Queries the kernel parameter at paramIndex in func's list of parameters and returns parameter information via paramOffset and paramSize. paramOffset returns the offset of the parameter in the device-side parameter layout. paramSize returns the size in bytes of the parameter. This information can be used to update kernel node parameters from the device via cudaGraphKernelNodeSetParam() and cudaGraphKernelNodeUpdatesApply(). paramIndex must be less than the number of parameters that func takes.

Note:

Note that this function may also return error codes from previous, asynchronous launches.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

__host__ âcudaError_t cudaFuncSetAttribute ( const void*Â func, cudaFuncAttributeÂ attr, int Â value )

Set attributes for a given function.

func: - Function to get attributes of
attr: - Attribute to set
value: - Value to set

This function sets the attributes of a function specified via func. The parameter func must be a pointer to a function that executes on the device. The parameter specified by func must be declared as a __global__ function. The enumeration defined by attr is set to the value defined by value. If the specified function does not exist, then it is assumed to be a cudaKernel_t and used as is. If the specified attribute cannot be written, or if the value is incorrect, then cudaErrorInvalidValue is returned.

Valid values for attr are:

cudaFuncAttributeMaxDynamicSharedMemorySize - The requested maximum size in bytes of dynamically-allocated shared memory. The sum of this value and the function attribute sharedSizeBytes cannot exceed the device attribute cudaDevAttrMaxSharedMemoryPerBlockOptin. The maximal size of requestable dynamic shared memory may differ by GPU architecture.
cudaFuncAttributePreferredSharedMemoryCarveout - On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. See cudaDevAttrMaxSharedMemoryPerMultiprocessor. This is only a hint, and the driver can choose a different ratio if required to execute the function.
cudaFuncAttributeRequiredClusterWidth: The required cluster width in blocks. The width, height, and depth values must either all be 0 or all be positive. The validity of the cluster dimensions is checked at launch time. If the value is set during compile time, it cannot be set at runtime. Setting it at runtime will return cudaErrorNotPermitted.
cudaFuncAttributeRequiredClusterHeight: The required cluster height in blocks. The width, height, and depth values must either all be 0 or all be positive. The validity of the cluster dimensions is checked at launch time. If the value is set during compile time, it cannot be set at runtime. Setting it at runtime will return cudaErrorNotPermitted.
cudaFuncAttributeRequiredClusterDepth: The required cluster depth in blocks. The width, height, and depth values must either all be 0 or all be positive. The validity of the cluster dimensions is checked at launch time. If the value is set during compile time, it cannot be set at runtime. Setting it at runtime will return cudaErrorNotPermitted.
cudaFuncAttributeNonPortableClusterSizeAllowed: Indicates whether the function can be launched with non-portable cluster size. 1 is allowed, 0 is disallowed.
cudaFuncAttributeClusterSchedulingPolicyPreference: The block scheduling policy of a function. The value type is cudaClusterSchedulingPolicy.

Note:

Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

cudaLaunchKernel (C++ API), cudaFuncSetCacheConfig ( C++ API), cudaFuncGetAttributes ( C API),

__host__ âcudaError_t cudaFuncSetCacheConfig ( const void*Â func, cudaFuncCacheÂ cacheConfig )

Sets the preferred cache configuration for a device function.

func: - Device function symbol
cacheConfig: - Requested cache configuration

On devices where the L1 cache and shared memory use the same hardware resources, this sets through cacheConfig the preferred cache configuration for the function specified via func. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute func.

func is a device function symbol and must be declared as a __global__ function. If the specified function does not exist, then cudaErrorInvalidDeviceFunction is returned. For templated functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>

This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.

Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point.

The supported cache configurations are:

cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory

__device__ â void* cudaGetParameterBuffer ( size_tÂ alignment, size_tÂ size )

Obtains a parameter buffer.

alignment: - Specifies alignment requirement of the parameter buffer
size: - Specifies size requirement in bytes

Returns pointer to the allocated parameterBuffer

Obtains a parameter buffer which can be filled with parameters for a kernel launch. Parameters passed to cudaLaunchDevice must be allocated via this function.

This is a low level API and can only be accessed from Parallel Thread Execution (PTX). CUDA user code should use <<< >>> to launch kernels.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cudaLaunchDevice

__device__ â void cudaGridDependencySynchronize ( void ) [inline]

Programmatic grid dependency synchronization.

__host__ âcudaError_t cudaLaunchCooperativeKernel ( const void*Â func, dim3Â gridDim, dim3Â blockDim, void**Â args, size_tÂ sharedMem, cudaStream_tÂ stream )

Launches a device function where thread blocks can cooperate and synchronize as they execute.

func: - Device function symbol
gridDim: - Grid dimentions
blockDim: - Block dimentions
args: - Arguments
sharedMem: - Shared memory
stream: - Stream identifier

The function invokes kernel func on gridDim (gridDim.xgridDim.ygridDim.z) grid of blocks. Each block contains blockDim (blockDim.xblockDim.yblockDim.z) threads.

The device on which this kernel is invoked must have a non-zero value for the device attribute cudaDevAttrCooperativeLaunch.

The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute cudaDevAttrMultiProcessorCount.

The kernel cannot make use of CUDA dynamic parallelism.

If the kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0] to args[N - 1], point to the region of memory from which the actual parameter will be copied.

For templated functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>

sharedMem sets the amount of dynamic shared memory that will be available to each thread block.

stream specifies a stream the invocation is associated to.

Note:

This function uses standard default stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

See also:

cudaLaunchCooperativeKernel (C++ API), cuLaunchCooperativeKernel

__device__ âcudaError_t cudaLaunchDevice ( void*Â func, void*Â parameterBuffer, dim3Â gridDimension, dim3Â blockDimension, unsigned int Â sharedMemSize, cudaStream_tÂ stream )

Launches a specified kernel.

func: - Pointer to the kernel to be launched
parameterBuffer: - Holds the parameters to the launched kernel. parameterBuffer can be NULL. (Optional)
gridDimension: - Specifies grid dimensions
blockDimension: - Specifies block dimensions
sharedMemSize: - Specifies size of shared memory
stream: - Specifies the stream to be used

Launches a specified kernel with the specified parameter buffer. A parameter buffer can be obtained by calling cudaGetParameterBuffer().

This is a low level API and can only be accessed from Parallel Thread Execution (PTX). CUDA user code should use <<< >>> to launch the kernels.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

Please refer to Execution Configuration and Parameter Buffer Layout from the CUDA Programming Guide for the detailed descriptions of launch configuration and parameter layout respectively.

See also:

cudaGetParameterBuffer

__host__ âcudaError_t cudaLaunchHostFunc ( cudaStream_tÂ stream, cudaHostFn_tÂ fn, void*Â userData )

Enqueues a host function call in a stream.

stream
fn: - The function to call once preceding stream operations are complete
userData: - User-specified data to be passed to the function

Enqueues a host function to run in a stream. The function will be called after currently enqueued work and will block work added after it.

The host function must not make any CUDA API calls. Attempting to use a CUDA API may result in cudaErrorNotPermitted, but this is not required. The host function must not perform any synchronization that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized.

For the purposes of Unified Memory, execution makes a number of guarantees:

The stream is considered idle for the duration of the function's execution. Thus, for example, the function may always use memory attached to the stream it was enqueued in.
The start of execution of the function has the same effect as synchronizing an event recorded in the same stream immediately prior to the function. It thus synchronizes streams which have been "joined" prior to the function.
Adding device work to any stream does not have the effect of making the stream active until all preceding host functions and stream callbacks have executed. Thus, for example, a function might use global attached memory even if work has been added to another stream, if the work has been ordered behind the function call with an event.
Completion of the function does not cause a stream to become active except as described above. The stream will remain idle if no device work follows the function, and will remain idle across consecutive host functions or stream callbacks without device work in between. Thus, for example, stream synchronization can be done by signaling from a host function at the end of the stream.

Note that, in constrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.

__host__ âcudaError_t cudaLaunchKernel ( const void*Â func, dim3Â gridDim, dim3Â blockDim, void**Â args, size_tÂ sharedMem, cudaStream_tÂ stream )

Launches a device function.

func: - Device function symbol
gridDim: - Grid dimentions
blockDim: - Block dimentions
args: - Arguments
sharedMem: - Shared memory
stream: - Stream identifier

cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorLaunchFailure, cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorSharedObjectInitFailed, cudaErrorInvalidPtx, cudaErrorUnsupportedPtxVersion, cudaErrorNoKernelImageForDevice, cudaErrorJitCompilerNotFound, cudaErrorJitCompilationDisabled

The function invokes kernel func on gridDim (gridDim.xgridDim.ygridDim.z) grid of blocks. Each block contains blockDim (blockDim.xblockDim.yblockDim.z) threads.

If the kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0] to args[N - 1], point to the region of memory from which the actual parameter will be copied.

For templated functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>

sharedMem sets the amount of dynamic shared memory that will be available to each thread block.

stream specifies a stream the invocation is associated to.

Note:

This function uses standard default stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

See also:

cudaLaunchKernel (C++ API), cuLaunchKernel

__host__ âcudaError_t cudaLaunchKernelExC ( const cudaLaunchConfig_t*Â config, const void*Â func, void**Â args )

Launches a CUDA function with launch-time configuration.

config: - Launch configuration
func: - Kernel to launch
args: - Array of pointers to kernel parameters

Note that the functionally equivalent variadic template cudaLaunchKernelEx is available for C++11 and newer.

Invokes the kernel func on config->gridDim (config->gridDim.xconfig->gridDim.yconfig->gridDim.z) grid of blocks. Each block contains config->blockDim (config->blockDim.xconfig->blockDim.yconfig->blockDim.z) threads.

config->dynamicSmemBytes sets the amount of dynamic shared memory that will be available to each thread block.

config->stream specifies a stream the invocation is associated to.

Configuration beyond grid and block dimensions, dynamic shared memory size, and stream can be provided with the following two fields of config:

config->attrs is an array of config->numAttrs contiguous cudaLaunchAttribute elements. The value of this pointer is not considered if config->numAttrs is zero. However, in that case, it is recommended to set the pointer to NULL. config->numAttrs is the number of attributes populating the first config->numAttrs positions of the config->attrs array.

If the kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0] to args[N - 1], point to the region of memory from which the actual parameter will be copied.

N.B. This function is so named to avoid unintentionally invoking the templated version, cudaLaunchKernelEx, for kernels taking a single void** or void* parameter.

Note:

This function uses standard default stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
The API can also be used with a kernel cudaKernel_t by querying the handle using cudaLibraryGetKernel() or cudaGetKernel and then passing it to the API by casting to void*. The symbol entryFuncAddr passed to cudaGetKernel should be a symbol that is registered with the same CUDA Runtime instance.
Passing a symbol that belongs that belongs to a different runtime instance will result in undefined behavior. The only type that can be reliably passed to a different runtime instance is cudaKernel_t

See also:

cudaLaunchKernelEx(const cudaLaunchConfig_t *config, void (*kernel)(ExpTypes...), ActTypes &&... args) "cudaLaunchKernelEx (C++ API)", cuLaunchKernelEx

__device__ â void cudaTriggerProgrammaticLaunchCompletion ( void ) [inline]

Programmatic dependency trigger.

This device function ensures the programmatic launch completion edges / events are fulfilled. See cudaLaunchAttributeID::cudaLaunchAttributeProgrammaticStreamSerialization and cudaLaunchAttributeID::cudaLaunchAttributeProgrammaticEvent for more information. The event / edge kick off only happens when every CTAs in the grid has either exited or called this function at least once, otherwise the kick off happens automatically after all warps finishes execution but before the grid completes. The kick off only enables scheduling of the secondary kernel. It provides no memory visibility guarantee itself. The user could enforce memory visibility by inserting a memory fence of the correct scope.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html below:

CUDA Runtime API :: CUDA Toolkit Documentation

Showing content from https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTEXECUTION.html below: