A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html below:

CUDA Runtime API :: CUDA Toolkit Documentation

__host__ ​cudaError_t cudaArrayGetInfo ( cudaChannelFormatDesc* desc, cudaExtent* extent, unsigned int* flags, cudaArray_t array )

Gets info about the specified cudaArray.

desc
- Returned array type
extent
- Returned array shape. 2D arrays will have depth of zero
flags
- Returned array flags
array
- The cudaArray to get info for
__host__ ​cudaError_t cudaArrayGetMemoryRequirements ( cudaArrayMemoryRequirements* memoryRequirements, cudaArray_t array, int  device )

Returns the memory requirements of a CUDA array.

memoryRequirements
- Pointer to cudaArrayMemoryRequirements
array
- CUDA array to get the memory requirements of
device
- Device to get the memory requirements for
__host__ ​cudaError_t cudaArrayGetPlane ( cudaArray_t* pPlaneArray, cudaArray_t hArray, unsigned int  planeIdx )

Gets a CUDA array plane from a CUDA array.

pPlaneArray
- Returned CUDA array referenced by the planeIdx
hArray
- CUDA array
planeIdx
- Plane index

Returns in pPlaneArray a CUDA array that represents a single format plane of the CUDA array hArray.

If planeIdx is greater than the maximum number of planes in this array or if the array does not have a multi-planar format e.g: cudaChannelFormatKindNV12, then cudaErrorInvalidValue is returned.

Note that if the hArray has format cudaChannelFormatKindNV12, then passing in 0 for planeIdx returns a CUDA array of the same size as hArray but with one 8-bit channel and cudaChannelFormatKindUnsigned as its format kind. If 1 is passed for planeIdx, then the returned CUDA array has half the height and width of hArray with two 8-bit channels and cudaChannelFormatKindUnsigned as its format kind.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuArrayGetPlane

__host__ ​cudaError_t cudaArrayGetSparseProperties ( cudaArraySparseProperties* sparseProperties, cudaArray_t array )

Returns the layout properties of a sparse CUDA array.

__host__ ​ __device__ ​cudaError_t cudaFree ( void* devPtr )

Frees memory on the device.

devPtr
- Device pointer to memory to free

Frees the memory space pointed to by devPtr, which must have been returned by a previous call to one of the following memory allocation APIs - cudaMalloc(), cudaMallocPitch(), cudaMallocManaged(), cudaMallocAsync(), cudaMallocFromPoolAsync().

Note - This API will not perform any implicit synchronization when the pointer was allocated with cudaMallocAsync or cudaMallocFromPoolAsync. Callers must ensure that all accesses to these pointer have completed before invoking cudaFree. For best performance and memory reuse, users should use cudaFreeAsync to free memory allocated via the stream ordered memory allocator. For all other pointers, this API may perform implicit synchronization.

If cudaFree(devPtr) has already been called before, an error is returned. If devPtr is 0, no operation is performed. cudaFree() returns cudaErrorValue in case of failure.

The device version of cudaFree cannot be used with a *devPtr allocated using the host API, and vice versa.

See also:

cudaMalloc, cudaMallocPitch, cudaMallocManaged, cudaMallocArray, cudaFreeArray, cudaMallocAsync, cudaMallocFromPoolAsynccudaMallocHost ( C API), cudaFreeHost, cudaMalloc3D, cudaMalloc3DArray, cudaFreeAsynccudaHostAlloc, cuMemFree

__host__ ​cudaError_t cudaFreeArray ( cudaArray_t array )

Frees an array on the device.

array
- Pointer to array to free
__host__ ​cudaError_t cudaFreeHost ( void* ptr )

Frees page-locked memory.

ptr
- Pointer to memory to free

Frees the memory space pointed to by hostPtr, which must have been returned by a previous call to cudaMallocHost() or cudaHostAlloc().

See also:

cudaMalloc, cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray, cudaMallocHost ( C API), cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc, cuMemFreeHost

__host__ ​cudaError_t cudaFreeMipmappedArray ( cudaMipmappedArray_t mipmappedArray )

Frees a mipmapped array on the device.

mipmappedArray
- Pointer to mipmapped array to free
__host__ ​cudaError_t cudaGetMipmappedArrayLevel ( cudaArray_t* levelArray, cudaMipmappedArray_const_t mipmappedArray, unsigned int  level )

Gets a mipmap level of a CUDA mipmapped array.

levelArray
- Returned mipmap level CUDA array
mipmappedArray
- CUDA mipmapped array
level
- Mipmap level

Returns in *levelArray a CUDA array that represents a single mipmap level of the CUDA mipmapped array mipmappedArray.

If level is greater than the maximum number of levels in this mipmapped array, cudaErrorInvalidValue is returned.

If mipmappedArray is NULL, cudaErrorInvalidResourceHandle is returned.

See also:

cudaMalloc3D, cudaMalloc, cudaMallocPitch, cudaFree, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, make_cudaExtent, cuMipmappedArrayGetLevel

__host__ ​cudaError_t cudaGetSymbolAddress ( void** devPtr, const void* symbol )

Finds the address associated with a CUDA symbol.

devPtr
- Return device pointer associated with symbol
symbol
- Device symbol address
__host__ ​cudaError_t cudaGetSymbolSize ( size_t* size, const void* symbol )

Finds the size of the object associated with a CUDA symbol.

size
- Size of object associated with symbol
symbol
- Device symbol address
__host__ ​cudaError_t cudaHostAlloc ( void** pHost, size_t size, unsigned int  flags )

Allocates page-locked memory on the host.

pHost
- Device pointer to allocated memory
size
- Requested allocation size in bytes
flags
- Requested properties of allocated memory

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

The flags parameter enables different options to be specified that affect the allocation, as follows.

All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.

In order for the cudaHostAllocMapped flag to have any effect, the CUDA context must support the cudaDeviceMapHost flag, which can be checked via cudaGetDeviceFlags(). The cudaDeviceMapHost flag is implicitly set for contexts created via the runtime API.

The cudaHostAllocMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostAllocPortable flag.

Memory allocated by this function must be freed with cudaFreeHost().

See also:

cudaSetDeviceFlags, cudaMallocHost ( C API), cudaFreeHost, cudaGetDeviceFlags, cuMemHostAlloc

__host__ ​cudaError_t cudaHostGetDevicePointer ( void** pDevice, void* pHost, unsigned int  flags )

Passes back device pointer of mapped host memory allocated by cudaHostAlloc or registered by cudaHostRegister.

pDevice
- Returned device pointer for mapped memory
pHost
- Requested host pointer mapping
flags
- Flags for extensions (must be 0 for now)

Passes back the device pointer corresponding to the mapped, pinned host buffer allocated by cudaHostAlloc() or registered by cudaHostRegister().

cudaHostGetDevicePointer() will fail if the cudaDeviceMapHost flag was not specified before deferred context creation occurred, or if called on a device that does not support mapped, pinned memory.

For devices that have a non-zero value for the device attribute cudaDevAttrCanUseHostPointerForRegisteredMem, the memory can also be accessed from the device using the host pointer pHost. The device pointer returned by cudaHostGetDevicePointer() may or may not match the original host pointer pHost and depends on the devices visible to the application. If all devices visible to the application have a non-zero value for the device attribute, the device pointer returned by cudaHostGetDevicePointer() will match the original pointer pHost. If any device visible to the application has a zero value for the device attribute, the device pointer returned by cudaHostGetDevicePointer() will not match the original host pointer pHost, but it will be suitable for use on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the memory using either pointer on devices that have a non-zero value for the device attribute. Note however that such devices should access the memory using only of the two pointers and not both.

flags provides for future releases. For now, it must be set to 0.

See also:

cudaSetDeviceFlags, cudaHostAlloc, cuMemHostGetDevicePointer

__host__ ​cudaError_t cudaHostGetFlags ( unsigned int* pFlags, void* pHost )

Passes back flags used to allocate pinned host memory allocated by cudaHostAlloc.

pFlags
- Returned flags word
pHost
- Host pointer
__host__ ​cudaError_t cudaHostRegister ( void* ptr, size_t size, unsigned int  flags )

Registers an existing host memory range for use by CUDA.

ptr
- Host pointer to memory to page-lock
size
- Size in bytes of the address range to page-lock in bytes
flags
- Flags for allocation request

Page-locks the memory range specified by ptr and size and maps it for the device(s) as specified by flags. This memory range also is added to the same tracking mechanism as cudaHostAlloc() to automatically accelerate calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory that has not been registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.

On systems where pageableMemoryAccessUsesHostPageTables is true, cudaHostRegister will not page-lock the memory range specified by ptr but only populate unpopulated pages.

cudaHostRegister is supported only on I/O coherent devices that have a non-zero value for the device attribute cudaDevAttrHostRegisterSupported.

The flags parameter enables different options to be specified that affect the allocation, as follows.

All of these flags are orthogonal to one another: a developer may page-lock memory that is portable or mapped with no restrictions.

The CUDA context must have been created with the cudaMapHost flag in order for the cudaHostRegisterMapped flag to have any effect.

The cudaHostRegisterMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostRegisterPortable flag.

For devices that have a non-zero value for the device attribute cudaDevAttrCanUseHostPointerForRegisteredMem, the memory can also be accessed from the device using the host pointer ptr. The device pointer returned by cudaHostGetDevicePointer() may or may not match the original host pointer ptr and depends on the devices visible to the application. If all devices visible to the application have a non-zero value for the device attribute, the device pointer returned by cudaHostGetDevicePointer() will match the original pointer ptr. If any device visible to the application has a zero value for the device attribute, the device pointer returned by cudaHostGetDevicePointer() will not match the original host pointer ptr, but it will be suitable for use on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the memory using either pointer on devices that have a non-zero value for the device attribute. Note however that such devices should access the memory using only of the two pointers and not both.

The memory page-locked by this function must be unregistered with cudaHostUnregister().

See also:

cudaHostUnregister, cudaHostGetFlags, cudaHostGetDevicePointer, cuMemHostRegister

__host__ ​cudaError_t cudaHostUnregister ( void* ptr )

Unregisters a memory range that was registered with cudaHostRegister.

ptr
- Host pointer to memory to unregister
__host__ ​ __device__ ​cudaError_t cudaMalloc ( void** devPtr, size_t size )

Allocate memory on the device.

devPtr
- Pointer to allocated device memory
size
- Requested allocation size in bytes

Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure.

The device version of cudaFree cannot be used with a *devPtr allocated using the host API, and vice versa.

See also:

cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray, cudaMalloc3D, cudaMalloc3DArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, cuMemAlloc

__host__ ​cudaError_t cudaMalloc3D ( cudaPitchedPtr* pitchedDevPtr, cudaExtent extent )

Allocates logical 1D, 2D, or 3D memory objects on the device.

pitchedDevPtr
- Pointer to allocated pitched device memory
extent
- Requested allocation size (width field in bytes)

Allocates at least width * height * depth bytes of linear memory on the device and returns a cudaPitchedPtr in which ptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met. The pitch returned in the pitch field of pitchedDevPtr is the width in bytes of the allocation.

The returned cudaPitchedPtr contains additional fields xsize and ysize, the logical width and height of the allocation, which are equivalent to the width and heightextent parameters provided by the programmer during allocation.

For allocations of 2D and 3D objects, it is highly recommended that programmers perform allocations using cudaMalloc3D() or cudaMallocPitch(). Due to alignment restrictions in the hardware, this is especially true if the application will be performing memory copies involving 2D or 3D objects (whether linear memory or CUDA arrays).

See also:

cudaMallocPitch, cudaFree, cudaMemcpy3D, cudaMemset3D, cudaMalloc3DArray, cudaMallocArray, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, make_cudaPitchedPtr, make_cudaExtent, cuMemAllocPitch

__host__ ​cudaError_t cudaMalloc3DArray ( cudaArray_t* array, const cudaChannelFormatDesc* desc, cudaExtent extent, unsigned int  flags = 0 )

Allocate an array on the device.

array
- Pointer to allocated array in device memory
desc
- Requested channel format
extent
- Requested allocation size (width field in elements)
flags
- Flags for extensions

Allocates a CUDA array according to the cudaChannelFormatDesc structure desc and returns a handle to the new CUDA array in *array.

The cudaChannelFormatDesc is defined as:

‎    struct cudaChannelFormatDesc {
              int x, y, z, w;
              enum cudaChannelFormatKind 
                  f;
          };

where

cudaChannelFormatKind

is one of

cudaChannelFormatKindSigned

,

cudaChannelFormatKindUnsigned

, or

cudaChannelFormatKindFloat

.

cudaMalloc3DArray() can allocate the following:

The flags parameter enables different options to be specified that affect the allocation, as follows.

The width, height and depth extents must meet certain size requirements as listed in the following table. All values are specified in elements.

Note that 2D CUDA arrays have different size requirements if the cudaArrayTextureGather flag is set. In that case, the valid range for (width, height, depth) is ((1,maxTexture2DGather[0]), (1,maxTexture2DGather[1]), 0).

CUDA array type Valid extents that must always be met {(width range in elements), (height range), (depth range)} Valid extents with cudaArraySurfaceLoadStore set {(width range in elements), (height range), (depth range)} 1D { (1,maxTexture1D), 0, 0 } { (1,maxSurface1D), 0, 0 } 2D { (1,maxTexture2D[0]), (1,maxTexture2D[1]), 0 } { (1,maxSurface2D[0]), (1,maxSurface2D[1]), 0 } 3D { (1,maxTexture3D[0]), (1,maxTexture3D[1]), (1,maxTexture3D[2]) } OR { (1,maxTexture3DAlt[0]), (1,maxTexture3DAlt[1]), (1,maxTexture3DAlt[2]) } { (1,maxSurface3D[0]), (1,maxSurface3D[1]), (1,maxSurface3D[2]) } 1D Layered { (1,maxTexture1DLayered[0]), 0, (1,maxTexture1DLayered[1]) } { (1,maxSurface1DLayered[0]), 0, (1,maxSurface1DLayered[1]) } 2D Layered { (1,maxTexture2DLayered[0]), (1,maxTexture2DLayered[1]), (1,maxTexture2DLayered[2]) } { (1,maxSurface2DLayered[0]), (1,maxSurface2DLayered[1]), (1,maxSurface2DLayered[2]) } Cubemap { (1,maxTextureCubemap), (1,maxTextureCubemap), 6 } { (1,maxSurfaceCubemap), (1,maxSurfaceCubemap), 6 } Cubemap Layered { (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[1]) } { (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[1]) }

See also:

cudaMalloc3D, cudaMalloc, cudaMallocPitch, cudaFree, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, make_cudaExtent, cuArray3DCreate

__host__ ​cudaError_t cudaMallocArray ( cudaArray_t* array, const cudaChannelFormatDesc* desc, size_t width, size_t height = 0, unsigned int  flags = 0 )

Allocate an array on the device.

array
- Pointer to allocated array in device memory
desc
- Requested channel format
width
- Requested array allocation width
height
- Requested array allocation height
flags
- Requested properties of allocated array

Allocates a CUDA array according to the cudaChannelFormatDesc structure desc and returns a handle to the new CUDA array in *array.

The cudaChannelFormatDesc is defined as:

‎    struct cudaChannelFormatDesc {
              int x, y, z, w;
          enum cudaChannelFormatKind 
                  f;
          };

where

cudaChannelFormatKind

is one of

cudaChannelFormatKindSigned

,

cudaChannelFormatKindUnsigned

, or

cudaChannelFormatKindFloat

.

The flags parameter enables different options to be specified that affect the allocation, as follows.

width and height must meet certain size requirements. See cudaMalloc3DArray() for more details.

See also:

cudaMalloc, cudaMallocPitch, cudaFree, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc, cuArrayCreate

__host__ ​cudaError_t cudaMallocHost ( void** ptr, size_t size )

Allocates page-locked memory on the host.

ptr
- Pointer to allocated host memory
size
- Requested allocation size in bytes

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().

On systems where pageableMemoryAccessUsesHostPageTables is true, cudaMallocHost may not page-lock the allocated memory.

Page-locking excessive amounts of memory with cudaMallocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

See also:

cudaMalloc, cudaMallocPitch, cudaMallocArray, cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc, cudaFree, cudaFreeArray, cudaMallocHost ( C++ API), cudaFreeHost, cudaHostAlloc, cuMemAllocHost

__host__ ​cudaError_t cudaMallocManaged ( void** devPtr, size_t size, unsigned int  flags = cudaMemAttachGlobal )

Allocates memory that will be automatically managed by the Unified Memory system.

Allocates size bytes of managed memory on the device and returns in *devPtr a pointer to the allocated memory. If the device doesn't support allocating managed memory, cudaErrorNotSupported is returned. Support for managed memory can be queried using the device attribute cudaDevAttrManagedMemory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. If size is 0, cudaMallocManaged returns cudaErrorInvalidValue. The pointer is valid on the CPU and on all GPUs in the system that support managed memory. All accesses to this pointer must obey the Unified Memory programming model.

flags specifies the default stream association for this allocation. flags must be one of cudaMemAttachGlobal or cudaMemAttachHost. The default value for flags is cudaMemAttachGlobal. If cudaMemAttachGlobal is specified, then this memory is accessible from any stream on any device. If cudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess; an explicit call to cudaStreamAttachMemAsync will be required to enable access on such devices.

If the association is later changed via cudaStreamAttachMemAsync to a single stream, the default association, as specifed during cudaMallocManaged, is restored when that stream is destroyed. For __managed__ variables, the default association is always cudaMemAttachGlobal. Note that destroying a stream is an asynchronous operation, and as a result, the change to default association won't happen until all work in the stream has completed.

Memory allocated with cudaMallocManaged should be released with cudaFree.

Device memory oversubscription is possible for GPUs that have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess. Managed memory on such GPUs may be evicted from device memory to host memory at any time by the Unified Memory driver in order to make room for other allocations.

In a system where all GPUs have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess, managed memory may not be populated when this API returns and instead may be populated on access. In such systems, managed memory can migrate to any processor's memory at any time. The Unified Memory driver will employ heuristics to maintain data locality and prevent excessive page faults to the extent possible. The application can also guide the driver about memory usage patterns via cudaMemAdvise. The application can also explicitly migrate memory to a desired processor's memory via cudaMemPrefetchAsync.

In a multi-GPU system where all of the GPUs have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess and all the GPUs have peer-to-peer support with each other, the physical storage for managed memory is created on the GPU which is active at the time cudaMallocManaged is called. All other GPUs will reference the data at reduced bandwidth via peer mappings over the PCIe bus. The Unified Memory driver does not migrate memory among such GPUs.

In a multi-GPU system where not all GPUs have peer-to-peer support with each other and where the value of the device attribute cudaDevAttrConcurrentManagedAccess is zero for at least one of those GPUs, the location chosen for physical storage of managed memory is system-dependent.

See also:

cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray, cudaMalloc3D, cudaMalloc3DArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, cudaDeviceGetAttribute, cudaStreamAttachMemAsync, cuMemAllocManaged

__host__ ​cudaError_t cudaMallocMipmappedArray ( cudaMipmappedArray_t* mipmappedArray, const cudaChannelFormatDesc* desc, cudaExtent extent, unsigned int  numLevels, unsigned int  flags = 0 )

Allocate a mipmapped array on the device.

mipmappedArray
- Pointer to allocated mipmapped array in device memory
desc
- Requested channel format
extent
- Requested allocation size (width field in elements)
numLevels
- Number of mipmap levels to allocate
flags
- Flags for extensions

Allocates a CUDA mipmapped array according to the cudaChannelFormatDesc structure desc and returns a handle to the new CUDA mipmapped array in *mipmappedArray. numLevels specifies the number of mipmap levels to be allocated. This value is clamped to the range [1, 1 + floor(log2(max(width, height, depth)))].

The cudaChannelFormatDesc is defined as:

‎    struct cudaChannelFormatDesc {
              int x, y, z, w;
              enum cudaChannelFormatKind 
                  f;
          };

where

cudaChannelFormatKind

is one of

cudaChannelFormatKindSigned

,

cudaChannelFormatKindUnsigned

, or

cudaChannelFormatKindFloat

.

cudaMallocMipmappedArray() can allocate the following:

The flags parameter enables different options to be specified that affect the allocation, as follows.

The width, height and depth extents must meet certain size requirements as listed in the following table. All values are specified in elements.

CUDA array type Valid extents that must always be met {(width range in elements), (height range), (depth range)} Valid extents with cudaArraySurfaceLoadStore set {(width range in elements), (height range), (depth range)} 1D { (1,maxTexture1DMipmap), 0, 0 } { (1,maxSurface1D), 0, 0 } 2D { (1,maxTexture2DMipmap[0]), (1,maxTexture2DMipmap[1]), 0 } { (1,maxSurface2D[0]), (1,maxSurface2D[1]), 0 } 3D { (1,maxTexture3D[0]), (1,maxTexture3D[1]), (1,maxTexture3D[2]) } OR { (1,maxTexture3DAlt[0]), (1,maxTexture3DAlt[1]), (1,maxTexture3DAlt[2]) } { (1,maxSurface3D[0]), (1,maxSurface3D[1]), (1,maxSurface3D[2]) } 1D Layered { (1,maxTexture1DLayered[0]), 0, (1,maxTexture1DLayered[1]) } { (1,maxSurface1DLayered[0]), 0, (1,maxSurface1DLayered[1]) } 2D Layered { (1,maxTexture2DLayered[0]), (1,maxTexture2DLayered[1]), (1,maxTexture2DLayered[2]) } { (1,maxSurface2DLayered[0]), (1,maxSurface2DLayered[1]), (1,maxSurface2DLayered[2]) } Cubemap { (1,maxTextureCubemap), (1,maxTextureCubemap), 6 } { (1,maxSurfaceCubemap), (1,maxSurfaceCubemap), 6 } Cubemap Layered { (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[1]) } { (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[1]) }

See also:

cudaMalloc3D, cudaMalloc, cudaMallocPitch, cudaFree, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc, make_cudaExtent, cuMipmappedArrayCreate

__host__ ​cudaError_t cudaMallocPitch ( void** devPtr, size_t* pitch, size_t width, size_t height )

Allocates pitched memory on the device.

devPtr
- Pointer to allocated pitched device memory
pitch
- Pitch for allocation
width
- Requested pitched allocation width (in bytes)
height
- Requested pitched allocation height

Allocates at least width (in bytes) * height bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The function may pad the allocation to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing as the address is updated from row to row. The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as:

‎    T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

For allocations of 2D arrays, it is recommended that programmers consider performing pitch allocations using cudaMallocPitch(). Due to pitch alignment restrictions in the hardware, this is especially true if the application will be performing 2D memory copies between different regions of device memory (whether linear memory or CUDA arrays).

See also:

cudaMalloc, cudaFree, cudaMallocArray, cudaFreeArray, cudaMallocHost ( C API), cudaFreeHost, cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc, cuMemAllocPitch

__host__ ​cudaError_t cudaMemAdvise ( const void* devPtr, size_t count, cudaMemoryAdvise advice, cudaMemLocation location )

Advise about the usage of a given memory range.

devPtr
- Pointer to memory to set the advice for
count
- Size in bytes of the memory range
advice
- Advice to be applied for the specified memory range
location
- location to apply the advice for

Advise the Unified Memory subsystem about the usage pattern for the memory range starting at devPtr with a size of count bytes. The start address and end address of the memory range will be rounded down and rounded up respectively to be aligned to CPU page size before the advice is applied. The memory range must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables. The memory range could also refer to system-allocated pageable memory provided it represents a valid, host-accessible region of memory and all additional constraints imposed by advice as outlined below are also satisfied. Specifying an invalid system-allocated pageable memory range results in an error being returned.

The advice parameter can take the following values:

See also:

cudaMemcpy, cudaMemcpyPeer, cudaMemcpyAsync, cudaMemcpy3DPeerAsync, cudaMemPrefetchAsync, cuMemAdvise

__host__ ​cudaError_t cudaMemDiscardAndPrefetchBatchAsync ( void** dptrs, size_t* sizes, size_t count, cudaMemLocation* prefetchLocs, size_t* prefetchLocIdxs, size_t numPrefetchLocs, unsigned long long flags, cudaStream_t stream )

Performs a batch of memory discards and prefetches asynchronously.

dptrs
- Array of pointers to be discarded
sizes
- Array of sizes for memory discard operations.
count
- Size of dptrs and sizes arrays.
prefetchLocs
- Array of locations to prefetch to.
prefetchLocIdxs
- Array of indices to specify which operands each entry in the prefetchLocs array applies to. The locations specified in prefetchLocs[k] will be applied to operations starting from prefetchLocIdxs[k] through prefetchLocIdxs[k+1] - 1. Also prefetchLocs[numPrefetchLocs - 1] will apply to copies starting from prefetchLocIdxs[numPrefetchLocs - 1] through count - 1.
numPrefetchLocs
- Size of prefetchLocs and prefetchLocIdxs arrays.
flags
- Flags reserved for future use. Must be zero.
stream

Performs a batch of memory discards followed by prefetches. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess otherwise the API will return an error.

Calling cudaMemDiscardAndPrefetchBatchAsync is semantically equivalent to calling cudaMemDiscardBatchAsync followed by cudaMemPrefetchBatchAsync, but is more optimal. For more details on what discarding and prefetching imply, please refer to cudaMemDiscardBatchAsync and cudaMemPrefetchBatchAsync respectively. Note that any reads, writes or prefetches to any part of the memory range that occur simultaneously with this combined discard+prefetch operation result in undefined behavior.

Performs memory discard and prefetch on address ranges specified in dptrs and sizes. Both arrays must be of the same length as specified by count. Each memory range specified must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value for cudaDevAttrPageableMemoryAccess. Every operation in the batch has to be associated with a valid location to prefetch the address range to and specified in the prefetchLocs array. Each entry in this array can apply to more than one operation. This can be done by specifying in the prefetchLocIdxs array, the index of the first operation that the corresponding entry in the prefetchLocs array applies to. Both prefetchLocs and prefetchLocIdxs must be of the same length as specified by numPrefetchLocs. For example, if a batch has 10 operations listed in dptrs/sizes, the first 6 of which are to be prefetched to one location and the remaining 4 are to be prefetched to another, then numPrefetchLocs will be 2, prefetchLocIdxs will be {0, 6} and prefetchLocs will contain the two set of locations. Note the first entry in prefetchLocIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less than count. Furthermore, numPrefetchLocs must be lesser than or equal to count.

__host__ ​cudaError_t cudaMemDiscardBatchAsync ( void** dptrs, size_t* sizes, size_t count, unsigned long long flags, cudaStream_t stream )

Performs a batch of memory discards asynchronously.

dptrs
- Array of pointers to be discarded
sizes
- Array of sizes for memory discard operations.
count
- Size of dptrs and sizes arrays.
flags
- Flags reserved for future use. Must be zero.
stream

Performs a batch of memory discards. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess otherwise the API will return an error.

Discarding a memory range informs the driver that the contents of that range are no longer useful. Discarding memory ranges allows the driver to optimize certain data migrations and can also help reduce memory pressure. This operation can be undone on any part of the range by either writing to it or prefetching it via cudaMemPrefetchAsync or cudaMemPrefetchBatchAsync. Reading from a discarded range, without a subsequent write or prefetch to that part of the range, will return an indeterminate value. Note that any reads, writes or prefetches to any part of the memory range that occur simultaneously with the discard operation result in undefined behavior.

Performs memory discard on address ranges specified in dptrs and sizes. Both arrays must be of the same length as specified by count. Each memory range specified must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value for cudaDevAttrPageableMemoryAccess.

__host__ ​cudaError_t cudaMemGetInfo ( size_t* free, size_t* total )

Gets free and total device memory.

free
- Returned free memory in bytes
total
- Returned total memory in bytes

Returns in *total the total amount of memory available to the the current context. Returns in *free the amount of memory on the device that is free according to the OS. CUDA is not guaranteed to be able to allocate all of the memory that the OS reports as free. In a multi-tenet situation, free estimate returned is prone to race condition where a new allocation/free done by a different process or a different thread in the same process between the time when free memory was estimated and reported, will result in deviation in free value reported and actual free memory.

The integrated GPU on Tegra shares memory with CPU and other component of the SoC. The free and total values returned by the API excludes the SWAP memory space maintained by the OS on some platforms. The OS may move some of the memory pages into swap area as the GPU or CPU allocate or access memory. See Tegra app note on how to calculate total and free memory on Tegra.

See also:

cuMemGetInfo

__host__ ​cudaError_t cudaMemPrefetchAsync ( const void* devPtr, size_t count, cudaMemLocation location, unsigned int  flags, cudaStream_t stream = 0 )

Prefetches memory to the specified destination location.

devPtr
- Pointer to be prefetched
count
- Size in bytes
location
- location to prefetch to
flags
- flags for future use, must be zero now.
stream
- Stream to enqueue prefetch operation

Prefetches memory to the specified destination location. devPtr is the base device pointer of the memory to be prefetched and location specifies the destination location. count specifies the number of bytes to copy. stream is the stream in which the operation is enqueued. The memory range must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables, or it may also refer to system-allocated memory on systems with non-zero cudaDevAttrPageableMemoryAccess.

Specifying cudaMemLocationTypeDevice for cudaMemLocation::type will prefetch memory to GPU specified by device ordinal cudaMemLocation::id which must have non-zero value for the device attribute concurrentManagedAccess. Additionally, stream must be associated with a device that has a non-zero value for the device attribute concurrentManagedAccess. Specifying cudaMemLocationTypeHost as cudaMemLocation::type will prefetch data to host memory. Applications can request prefetching memory to a specific host NUMA node by specifying cudaMemLocationTypeHostNuma for cudaMemLocation::type and a valid host NUMA node id in cudaMemLocation::id Users can also request prefetching memory to the host NUMA node closest to the current thread's CPU by specifying cudaMemLocationTypeHostNumaCurrent for cudaMemLocation::type. Note when cudaMemLocation::type is etiher cudaMemLocationTypeHost OR cudaMemLocationTypeHostNumaCurrent, cudaMemLocation::id will be ignored.

The start address and end address of the memory range will be rounded down and rounded up respectively to be aligned to CPU page size before the prefetch operation is enqueued in the stream.

If no physical memory has been allocated for this region, then this memory region will be populated and mapped on the destination device. If there's insufficient memory to prefetch the desired region, the Unified Memory driver may evict pages from other cudaMallocManaged allocations to host memory in order to make room. Device memory allocated using cudaMalloc or cudaMallocArray will not be evicted.

By default, any mappings to the previous location of the migrated pages are removed and mappings for the new location are only setup on the destination location. The exact behavior however also depends on the settings applied to this memory range via cuMemAdvise as described below:

If cudaMemAdviseSetReadMostly was set on any subset of this memory range, then that subset will create a read-only copy of the pages on destination location. If however the destination location is a host NUMA node, then any pages of that subset that are already in another host NUMA node will be transferred to the destination.

If cudaMemAdviseSetPreferredLocation was called on any subset of this memory range, then the pages will be migrated to location even if location is not the preferred location of any pages in the memory range.

If cudaMemAdviseSetAccessedBy was called on any subset of this memory range, then mappings to those pages from all the appropriate processors are updated to refer to the new location if establishing such a mapping is possible. Otherwise, those mappings are cleared.

Note that this API is not required for functionality and only serves to improve performance by allowing the application to migrate data to a suitable location before it is accessed. Memory accesses to this range are always coherent and are allowed even when the data is actively being migrated.

Note that this function is asynchronous with respect to the host and all work on other devices.

See also:

cudaMemcpy, cudaMemcpyPeer, cudaMemcpyAsync, cudaMemcpy3DPeerAsync, cudaMemAdvise, cuMemPrefetchAsync

__host__ ​cudaError_t cudaMemPrefetchBatchAsync ( void** dptrs, size_t* sizes, size_t count, cudaMemLocation* prefetchLocs, size_t* prefetchLocIdxs, size_t numPrefetchLocs, unsigned long long flags, cudaStream_t stream )

Performs a batch of memory prefetches asynchronously.

dptrs
- Array of pointers to be prefetched
sizes
- Array of sizes for memory prefetch operations.
count
- Size of dptrs and sizes arrays.
prefetchLocs
- Array of locations to prefetch to.
prefetchLocIdxs
- Array of indices to specify which operands each entry in the prefetchLocs array applies to. The locations specified in prefetchLocs[k] will be applied to copies starting from prefetchLocIdxs[k] through prefetchLocIdxs[k+1] - 1. Also prefetchLocs[numPrefetchLocs - 1] will apply to prefetches starting from prefetchLocIdxs[numPrefetchLocs - 1] through count - 1.
numPrefetchLocs
- Size of prefetchLocs and prefetchLocIdxs arrays.
flags
- Flags reserved for future use. Must be zero.
stream

Performs a batch of memory prefetches. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess otherwise the API will return an error.

The semantics of the individual prefetch operations are as described in cudaMemPrefetchAsync.

Performs memory prefetch on address ranges specified in dptrs and sizes. Both arrays must be of the same length as specified by count. Each memory range specified must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value for cudaDevAttrPageableMemoryAccess. The prefetch location for every operation in the batch is specified in the prefetchLocs array. Each entry in this array can apply to more than one operation. This can be done by specifying in the prefetchLocIdxs array, the index of the first prefetch operation that the corresponding entry in the prefetchLocs array applies to. Both prefetchLocs and prefetchLocIdxs must be of the same length as specified by numPrefetchLocs. For example, if a batch has 10 prefetches listed in dptrs/sizes, the first 4 of which are to be prefetched to one location and the remaining 6 are to be prefetched to another, then numPrefetchLocs will be 2, prefetchLocIdxs will be {0, 4} and prefetchLocs will contain the two locations. Note the first entry in prefetchLocIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less than count. Furthermore, numPrefetchLocs must be lesser than or equal to count.

__host__ ​cudaError_t cudaMemRangeGetAttribute ( void* data, size_t dataSize, cudaMemRangeAttribute attribute, const void* devPtr, size_t count )

Query an attribute of a given memory range.

data
- A pointers to a memory location where the result of each attribute query will be written to.
dataSize
- Array containing the size of data
attribute
- The attribute to query
devPtr
- Start of the range to query
count
- Size of the range to query

Query an attribute about the memory range starting at devPtr with a size of count bytes. The memory range must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables.

The attribute parameter can take the following values:

See also:

cudaMemRangeGetAttributes, cudaMemPrefetchAsync, cudaMemAdvise, cuMemRangeGetAttribute

__host__ ​cudaError_t cudaMemRangeGetAttributes ( void** data, size_t* dataSizes, cudaMemRangeAttribute ** attributes, size_t numAttributes, const void* devPtr, size_t count )

Query attributes of a given memory range.

data
- A two-dimensional array containing pointers to memory locations where the result of each attribute query will be written to.
dataSizes
- Array containing the sizes of each result
attributes
- An array of attributes to query (numAttributes and the number of attributes in this array should match)
numAttributes
- Number of attributes to query
devPtr
- Start of the range to query
count
- Size of the range to query
__host__ ​cudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )

Copies data between host and device.

dst
- Destination memory address
src
- Source memory address
count
- Size in bytes to copy
kind
- Type of transfer

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

Note:

See also:

cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpyDtoH, cuMemcpyHtoD, cuMemcpyDtoD, cuMemcpy

__host__ ​cudaError_t cudaMemcpy2D ( void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, cudaMemcpyKind kind )

Copies data between host and device.

dst
- Destination memory address
dpitch
- Pitch of destination memory
src
- Source memory address
spitch
- Pitch of source memory
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. dpitch and spitch are the widths in memory in bytes of the 2D arrays pointed to by dst and src, including any padding added to the end of each row. The memory areas may not overlap. width must not exceed either dpitch or spitch. Calling cudaMemcpy2D() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. cudaMemcpy2D() returns an error if dpitch or spitch exceeds the maximum allowed.

Note:

See also:

cudaMemcpy, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2D, cuMemcpy2DUnaligned

__host__ ​cudaError_t cudaMemcpy2DArrayToArray ( cudaArray_t dst, size_t wOffsetDst, size_t hOffsetDst, cudaArray_const_t src, size_t wOffsetSrc, size_t hOffsetSrc, size_t width, size_t height, cudaMemcpyKind kind = cudaMemcpyDeviceToDevice )

Copies data between host and device.

dst
- Destination memory address
wOffsetDst
- Destination starting X offset (columns in bytes)
hOffsetDst
- Destination starting Y offset (rows)
src
- Source memory address
wOffsetSrc
- Source starting X offset (columns in bytes)
hOffsetSrc
- Source starting Y offset (rows)
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer

Copies a matrix (height rows of width bytes each) from the CUDA array src starting at hOffsetSrc rows and wOffsetSrc bytes from the upper left corner to the CUDA array dst starting at hOffsetDst rows and wOffsetDst bytes from the upper left corner, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. wOffsetDst + width must not exceed the width of the CUDA array dst. wOffsetSrc + width must not exceed the width of the CUDA array src.

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2D, cuMemcpy2DUnaligned

__host__ ​ __device__ ​cudaError_t cudaMemcpy2DAsync ( void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data between host and device.

dst
- Destination memory address
dpitch
- Pitch of destination memory
src
- Source memory address
spitch
- Pitch of source memory
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer
stream
- Stream identifier

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. dpitch and spitch are the widths in memory in bytes of the 2D arrays pointed to by dst and src, including any padding added to the end of each row. The memory areas may not overlap. width must not exceed either dpitch or spitch.

Calling cudaMemcpy2DAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. cudaMemcpy2DAsync() returns an error if dpitch or spitch is greater than the maximum allowed.

cudaMemcpy2DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

The device version of this function only handles device to device copies and cannot be given local or shared pointers.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2DAsync

__host__ ​cudaError_t cudaMemcpy2DFromArray ( void* dst, size_t dpitch, cudaArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, cudaMemcpyKind kind )

Copies data between host and device.

dst
- Destination memory address
dpitch
- Pitch of destination memory
src
- Source memory address
wOffset
- Source starting X offset (columns in bytes)
hOffset
- Source starting Y offset (rows)
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer

Copies a matrix (height rows of width bytes each) from the CUDA array src starting at hOffset rows and wOffset bytes from the upper left corner to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. dpitch is the width in memory in bytes of the 2D array pointed to by dst, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array src. width must not exceed dpitch. cudaMemcpy2DFromArray() returns an error if dpitch exceeds the maximum allowed.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2D, cuMemcpy2DUnaligned

__host__ ​cudaError_t cudaMemcpy2DFromArrayAsync ( void* dst, size_t dpitch, cudaArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data between host and device.

dst
- Destination memory address
dpitch
- Pitch of destination memory
src
- Source memory address
wOffset
- Source starting X offset (columns in bytes)
hOffset
- Source starting Y offset (rows)
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer
stream
- Stream identifier

Copies a matrix (height rows of width bytes each) from the CUDA array src starting at hOffset rows and wOffset bytes from the upper left corner to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. dpitch is the width in memory in bytes of the 2D array pointed to by dst, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array src. width must not exceed dpitch. cudaMemcpy2DFromArrayAsync() returns an error if dpitch exceeds the maximum allowed.

cudaMemcpy2DFromArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync,

cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2DAsync

__host__ ​cudaError_t cudaMemcpy2DToArray ( cudaArray_t dst, size_t wOffset, size_t hOffset, const void* src, size_t spitch, size_t width, size_t height, cudaMemcpyKind kind )

Copies data between host and device.

dst
- Destination memory address
wOffset
- Destination starting X offset (columns in bytes)
hOffset
- Destination starting Y offset (rows)
src
- Source memory address
spitch
- Pitch of source memory
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the CUDA array dst starting at hOffset rows and wOffset bytes from the upper left corner, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. spitch is the width in memory in bytes of the 2D array pointed to by src, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array dst. width must not exceed spitch. cudaMemcpy2DToArray() returns an error if spitch exceeds the maximum allowed.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2D, cuMemcpy2DUnaligned

__host__ ​cudaError_t cudaMemcpy2DToArrayAsync ( cudaArray_t dst, size_t wOffset, size_t hOffset, const void* src, size_t spitch, size_t width, size_t height, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data between host and device.

dst
- Destination memory address
wOffset
- Destination starting X offset (columns in bytes)
hOffset
- Destination starting Y offset (rows)
src
- Source memory address
spitch
- Pitch of source memory
width
- Width of matrix transfer (columns in bytes)
height
- Height of matrix transfer (rows)
kind
- Type of transfer
stream
- Stream identifier

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the CUDA array dst starting at hOffset rows and wOffset bytes from the upper left corner, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. spitch is the width in memory in bytes of the 2D array pointed to by src, including any padding added to the end of each row. wOffset + width must not exceed the width of the CUDA array dst. width must not exceed spitch. cudaMemcpy2DToArrayAsync() returns an error if spitch exceeds the maximum allowed.

cudaMemcpy2DToArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync,

cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy2DAsync

__host__ ​cudaError_t cudaMemcpy3D ( const cudaMemcpy3DParms* p )

Copies data between 3D objects.

p
- 3D memory copy parameters
‎struct cudaExtent {
        size_t width;
        size_t height;
        size_t depth;
      };
      struct cudaExtent 
                  make_cudaExtent(size_t w, size_t h, size_t d);
      
      struct cudaPos {
        size_t x;
        size_t y;
        size_t z;
      };
      struct cudaPos 
                  make_cudaPos(size_t x, size_t y, size_t z);
      
      struct cudaMemcpy3DParms {
        cudaArray_t           
                  srcArray;
        struct cudaPos        
                  srcPos;
        struct cudaPitchedPtr 
                  srcPtr;
        cudaArray_t           
                  dstArray;
        struct cudaPos        
                  dstPos;
        struct cudaPitchedPtr 
                  dstPtr;
        struct cudaExtent     
                  extent;
        enum cudaMemcpyKind   
                  kind;
      };

cudaMemcpy3D() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by the cudaMemcpy3DParms struct which should be initialized to zero before use:

‎cudaMemcpy3DParms myParms = {0};

The struct passed to cudaMemcpy3D() must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Passing more than one non-zero source or destination will cause cudaMemcpy3D() to return an error.

The srcPos and dstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to be unsigned char.

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char.

The kind field defines the direction of the copy. It must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. For cudaMemcpyHostToHost or cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost passed as kind and cudaArray type passed as source or destination, if the kind implies cudaArray type to be present on the host, cudaMemcpy3D() will disregard that implication and silently correct the kind based on the fact that cudaArray type can only be present on the device.

If the source and destination are both arrays, cudaMemcpy3D() will return an error if they do not have the same element size.

The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.

The source object must entirely contain the region defined by srcPos and extent. The destination object must entirely contain the region defined by dstPos and extent.

cudaMemcpy3D() returns an error if the pitch of srcPtr or dstPtr exceeds the maximum allowed. The pitch of a cudaPitchedPtr allocated with cudaMalloc3D() will always be valid.

See also:

cudaMalloc3D, cudaMalloc3DArray, cudaMemset3D, cudaMemcpy3DAsync, cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, make_cudaExtent, make_cudaPos, cuMemcpy3D

__host__ ​ __device__ ​cudaError_t cudaMemcpy3DAsync ( const cudaMemcpy3DParms* p, cudaStream_t stream = 0 )

Copies data between 3D objects.

p
- 3D memory copy parameters
stream
- Stream identifier
‎struct cudaExtent {
        size_t width;
        size_t height;
        size_t depth;
      };
      struct cudaExtent 
                  make_cudaExtent(size_t w, size_t h, size_t d);
      
      struct cudaPos {
        size_t x;
        size_t y;
        size_t z;
      };
      struct cudaPos 
                  make_cudaPos(size_t x, size_t y, size_t z);
      
      struct cudaMemcpy3DParms {
        cudaArray_t           
                  srcArray;
        struct cudaPos        
                  srcPos;
        struct cudaPitchedPtr 
                  srcPtr;
        cudaArray_t           
                  dstArray;
        struct cudaPos        
                  dstPos;
        struct cudaPitchedPtr 
                  dstPtr;
        struct cudaExtent     
                  extent;
        enum cudaMemcpyKind   
                  kind;
      };

cudaMemcpy3DAsync() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by the cudaMemcpy3DParms struct which should be initialized to zero before use:

‎cudaMemcpy3DParms myParms = {0};

The struct passed to cudaMemcpy3DAsync() must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Passing more than one non-zero source or destination will cause cudaMemcpy3DAsync() to return an error.

The srcPos and dstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to be unsigned char. For CUDA arrays, positions must be in the range [0, 2048) for any dimension.

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char.

The kind field defines the direction of the copy. It must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. For cudaMemcpyHostToHost or cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost passed as kind and cudaArray type passed as source or destination, if the kind implies cudaArray type to be present on the host, cudaMemcpy3DAsync() will disregard that implication and silently correct the kind based on the fact that cudaArray type can only be present on the device.

If the source and destination are both arrays, cudaMemcpy3DAsync() will return an error if they do not have the same element size.

The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.

The source object must lie entirely within the region defined by srcPos and extent. The destination object must lie entirely within the region defined by dstPos and extent.

cudaMemcpy3DAsync() returns an error if the pitch of srcPtr or dstPtr exceeds the maximum allowed. The pitch of a cudaPitchedPtr allocated with cudaMalloc3D() will always be valid.

cudaMemcpy3DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

The device version of this function only handles device to device copies and cannot be given local or shared pointers.

See also:

cudaMalloc3D, cudaMalloc3DArray, cudaMemset3D, cudaMemcpy3D, cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, :cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, make_cudaExtent, make_cudaPos, cuMemcpy3DAsync

__host__ ​cudaError_t cudaMemcpy3DBatchAsync ( size_t numOps, cudaMemcpy3DBatchOp* opList, unsigned long long flags, cudaStream_t stream )

Performs a batch of 3D memory copies asynchronously.

numOps
- Total number of memcpy operations.
opList
- Array of size numOps containing the actual memcpy operations.
flags
- Flags for future use, must be zero now.
stream

Performs a batch of memory copies. The batch as a whole executes in stream order but copies within a batch are not guaranteed to execute in any specific order. Note that this means specifying any dependent copies within a batch will result in undefined behavior.

Performs memory copies as specified in the opList array. The length of this array is specified in numOps. Each entry in this array describes a copy operation. This includes among other things, the source and destination operands for the copy as specified in cudaMemcpy3DBatchOp::src and cudaMemcpy3DBatchOp::dst respectively. The source and destination operands of a copy can either be a pointer or a CUDA array. The width, height and depth of a copy is specified in cudaMemcpy3DBatchOp::extent. The width, height and depth of a copy are specified in elements and must not be zero. For pointer-to-pointer copies, the element size is considered to be 1. For pointer to CUDA array or vice versa copies, the element size is determined by the CUDA array. For CUDA array to CUDA array copies, the element size of the two CUDA arrays must match.

For a given operand, if cudaMemcpy3DOperand::type is specified as cudaMemcpyOperandTypePointer, then cudaMemcpy3DOperand::op::ptr will be used. The cudaMemcpy3DOperand::op::ptr::ptr field must contain the pointer where the copy should begin. The cudaMemcpy3DOperand::op::ptr::rowLength field specifies the length of each row in elements and must either be zero or be greater than or equal to the width of the copy specified in cudaMemcpy3DBatchOp::extent::width. The cudaMemcpy3DOperand::op::ptr::layerHeight field specifies the height of each layer and must either be zero or be greater than or equal to the height of the copy specified in cudaMemcpy3DBatchOp::extent::height. When either of these values is zero, that aspect of the operand is considered to be tightly packed according to the copy extent. For managed memory pointers on devices where cudaDevAttrConcurrentManagedAccess is true or system-allocated pageable memory on devices where cudaDevAttrPageableMemoryAccess is true, the cudaMemcpy3DOperand::op::ptr::locHint field can be used to hint the location of the operand.

If an operand's type is specified as cudaMemcpyOperandTypeArray, then cudaMemcpy3DOperand::op::array will be used. The cudaMemcpy3DOperand::op::array::array field specifies the CUDA array and cudaMemcpy3DOperand::op::array::offset specifies the 3D offset into that array where the copy begins.

The cudaMemcpyAttributes::srcAccessOrder indicates the source access ordering to be observed for copies associated with the attribute. If the source access order is set to cudaMemcpySrcAccessOrderStream, then the source will be accessed in stream order. If the source access order is set to cudaMemcpySrcAccessOrderDuringApiCall then it indicates that access to the source pointer can be out of stream order and all accesses must be complete before the API call returns. This flag is suited for ephemeral sources (ex., stack variables) when it's known that no prior operations in the stream can be accessing the memory and also that the lifetime of the memory is limited to the scope that the source variable was declared in. Specifying this flag allows the driver to optimize the copy and removes the need for the user to synchronize the stream after the API call. If the source access order is set to cudaMemcpySrcAccessOrderAny then it indicates that access to the source pointer can be out of stream order and the accesses can happen even after the API call returns. This flag is suited for host pointers allocated outside CUDA (ex., via malloc) when it's known that no prior operations in the stream can be accessing the memory. Specifying this flag allows the driver to optimize the copy on certain platforms. Each memcopy operation in opList must have a valid srcAccessOrder setting, otherwise this API will return cudaErrorInvalidValue.

The cudaMemcpyAttributes::flags field can be used to specify certain flags for copies. Setting the cudaMemcpyFlagPreferOverlapWithCompute flag indicates that the associated copies should preferably overlap with any compute work. Note that this flag is a hint and can be ignored depending on the platform and other parameters of the copy.

Note:
__host__ ​cudaError_t cudaMemcpy3DPeer ( const cudaMemcpy3DPeerParms* p )

Copies memory between devices.

p
- Parameters for the memory copy
__host__ ​cudaError_t cudaMemcpy3DPeerAsync ( const cudaMemcpy3DPeerParms* p, cudaStream_t stream = 0 )

Copies memory between devices asynchronously.

p
- Parameters for the memory copy
stream
- Stream identifier
__host__ ​ __device__ ​cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data between host and device.

dst
- Destination memory address
src
- Source memory address
count
- Size in bytes to copy
kind
- Type of transfer
stream
- Stream identifier

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

The memory areas may not overlap. Calling cudaMemcpyAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

cudaMemcpyAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams.

The device version of this function only handles device to device copies and cannot be given local or shared pointers.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpyAsync, cuMemcpyDtoHAsync, cuMemcpyHtoDAsync, cuMemcpyDtoDAsync

__host__ ​cudaError_t cudaMemcpyBatchAsync ( const void** dsts, const void** srcs, const size_t* sizes, size_t count, cudaMemcpyAttributes* attrs, size_t* attrsIdxs, size_t numAttrs, cudaStream_t stream )

Performs a batch of memory copies asynchronously.

dsts
- Array of destination pointers.
srcs
- Array of memcpy source pointers.
sizes
- Array of sizes for memcpy operations.
count
- Size of dsts, srcs and sizes arrays
attrs
- Array of memcpy attributes.
attrsIdxs
- Array of indices to specify which copies each entry in the attrs array applies to. The attributes specified in attrs[k] will be applied to copies starting from attrsIdxs[k] through attrsIdxs[k+1] - 1. Also attrs[numAttrs-1] will apply to copies starting from attrsIdxs[numAttrs-1] through count - 1.
numAttrs
- Size of attrs and attrsIdxs arrays.
stream

Performs a batch of memory copies. The batch as a whole executes in stream order but copies within a batch are not guaranteed to execute in any specific order. This API only supports pointer-to-pointer copies. For copies involving CUDA arrays, please see cudaMemcpy3DBatchAsync.

Performs memory copies from source buffers specified in srcs to destination buffers specified in dsts. The size of each copy is specified in sizes. All three arrays must be of the same length as specified by count. Since there are no ordering guarantees for copies within a batch, specifying any dependent copies within a batch will result in undefined behavior.

Every copy in the batch has to be associated with a set of attributes specified in the attrs array. Each entry in this array can apply to more than one copy. This can be done by specifying in the attrsIdxs array, the index of the first copy that the corresponding entry in the attrs array applies to. Both attrs and attrsIdxs must be of the same length as specified by numAttrs. For example, if a batch has 10 copies listed in dst/src/sizes, the first 6 of which have one set of attributes and the remaining 4 another, then numAttrs will be 2, attrsIdxs will be {0, 6} and attrs will contains the two sets of attributes. Note that the first entry in attrsIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less than count. Furthermore, numAttrs must be lesser than or equal to count.

The cudaMemcpyAttributes::srcAccessOrder indicates the source access ordering to be observed for copies associated with the attribute. If the source access order is set to cudaMemcpySrcAccessOrderStream, then the source will be accessed in stream order. If the source access order is set to cudaMemcpySrcAccessOrderDuringApiCall then it indicates that access to the source pointer can be out of stream order and all accesses must be complete before the API call returns. This flag is suited for ephemeral sources (ex., stack variables) when it's known that no prior operations in the stream can be accessing the memory and also that the lifetime of the memory is limited to the scope that the source variable was declared in. Specifying this flag allows the driver to optimize the copy and removes the need for the user to synchronize the stream after the API call. If the source access order is set to cudaMemcpySrcAccessOrderAny then it indicates that access to the source pointer can be out of stream order and the accesses can happen even after the API call returns. This flag is suited for host pointers allocated outside CUDA (ex., via malloc) when it's known that no prior operations in the stream can be accessing the memory. Specifying this flag allows the driver to optimize the copy on certain platforms. Each memcpy operation in the batch must have a valid cudaMemcpyAttributes corresponding to it including the appropriate srcAccessOrder setting, otherwise the API will return cudaErrorInvalidValue.

The cudaMemcpyAttributes::srcLocHint and cudaMemcpyAttributes::dstLocHint allows applications to specify hint locations for operands of a copy when the operand doesn't have a fixed location. That is, these hints are only applicable for managed memory pointers on devices where cudaDevAttrConcurrentManagedAccess is true or system-allocated pageable memory on devices where cudaDevAttrPageableMemoryAccess is true. For other cases, these hints are ignored.

The cudaMemcpyAttributes::flags field can be used to specify certain flags for copies. Setting the cudaMemcpyFlagPreferOverlapWithCompute flag indicates that the associated copies should preferably overlap with any compute work. Note that this flag is a hint and can be ignored depending on the platform and other parameters of the copy.

Note:
__host__ ​cudaError_t cudaMemcpyFromSymbol ( void* dst, const void* symbol, size_t count, size_t offset = 0, cudaMemcpyKind kind = cudaMemcpyDeviceToHost )

Copies data from the given symbol on the device.

dst
- Destination memory address
symbol
- Device symbol address
count
- Size in bytes to copy
offset
- Offset from start of symbol in bytes
kind
- Type of transfer

Copies count bytes from the memory area pointed to by offset bytes from the start of symbol symbol to the memory area pointed to by dst. The memory areas may not overlap. symbol is a variable that resides in global or constant memory space. kind can be either cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy, cuMemcpyDtoH, cuMemcpyDtoD

__host__ ​cudaError_t cudaMemcpyFromSymbolAsync ( void* dst, const void* symbol, size_t count, size_t offset, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data from the given symbol on the device.

dst
- Destination memory address
symbol
- Device symbol address
count
- Size in bytes to copy
offset
- Offset from start of symbol in bytes
kind
- Type of transfer
stream
- Stream identifier

Copies count bytes from the memory area pointed to by offset bytes from the start of symbol symbol to the memory area pointed to by dst. The memory areas may not overlap. symbol is a variable that resides in global or constant memory space. kind can be either cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

cudaMemcpyFromSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cuMemcpyAsync, cuMemcpyDtoHAsync, cuMemcpyDtoDAsync

__host__ ​cudaError_t cudaMemcpyPeer ( void* dst, int  dstDevice, const void* src, int  srcDevice, size_t count )

Copies memory between two devices.

dst
- Destination device pointer
dstDevice
- Destination device
src
- Source device pointer
srcDevice
- Source device
count
- Size of memory copy in bytes

Copies memory from one device to memory on another device. dst is the base device pointer of the destination memory and dstDevice is the destination device. src is the base device pointer of the source memory and srcDevice is the source device. count specifies the number of bytes to copy.

Note that this function is asynchronous with respect to the host, but serialized with respect all pending and future asynchronous work in to the current device, srcDevice, and dstDevice (use cudaMemcpyPeerAsync to avoid this synchronization).

See also:

cudaMemcpy, cudaMemcpyAsync, cudaMemcpyPeerAsync, cudaMemcpy3DPeerAsync, cuMemcpyPeer

__host__ ​cudaError_t cudaMemcpyPeerAsync ( void* dst, int  dstDevice, const void* src, int  srcDevice, size_t count, cudaStream_t stream = 0 )

Copies memory between two devices asynchronously.

dst
- Destination device pointer
dstDevice
- Destination device
src
- Source device pointer
srcDevice
- Source device
count
- Size of memory copy in bytes
stream
- Stream identifier

Copies memory from one device to memory on another device. dst is the base device pointer of the destination memory and dstDevice is the destination device. src is the base device pointer of the source memory and srcDevice is the source device. count specifies the number of bytes to copy.

Note that this function is asynchronous with respect to the host and all work on other devices.

See also:

cudaMemcpy, cudaMemcpyPeer, cudaMemcpyAsync, cudaMemcpy3DPeerAsync, cuMemcpyPeerAsync

__host__ ​cudaError_t cudaMemcpyToSymbol ( const void* symbol, const void* src, size_t count, size_t offset = 0, cudaMemcpyKind kind = cudaMemcpyHostToDevice )

Copies data to the given symbol on the device.

symbol
- Device symbol address
src
- Source memory address
count
- Size in bytes to copy
offset
- Offset from start of symbol in bytes
kind
- Type of transfer

Copies count bytes from the memory area pointed to by src to the memory area pointed to by offset bytes from the start of symbol symbol. The memory areas may not overlap. symbol is a variable that resides in global or constant memory space. kind can be either cudaMemcpyHostToDevice, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

Note:

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync, cuMemcpy, cuMemcpyHtoD, cuMemcpyDtoD

__host__ ​cudaError_t cudaMemcpyToSymbolAsync ( const void* symbol, const void* src, size_t count, size_t offset, cudaMemcpyKind kind, cudaStream_t stream = 0 )

Copies data to the given symbol on the device.

symbol
- Device symbol address
src
- Source memory address
count
- Size in bytes to copy
offset
- Offset from start of symbol in bytes
kind
- Type of transfer
stream
- Stream identifier

Copies count bytes from the memory area pointed to by src to the memory area pointed to by offset bytes from the start of symbol symbol. The memory areas may not overlap. symbol is a variable that resides in global or constant memory space. kind can be either cudaMemcpyHostToDevice, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

cudaMemcpyToSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice and stream is non-zero, the copy may overlap with operations in other streams.

See also:

cudaMemcpy, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpy2DToArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyFromSymbolAsync, cuMemcpyAsync, cuMemcpyHtoDAsync, cuMemcpyDtoDAsync

__host__ ​cudaError_t cudaMemset ( void* devPtr, int  value, size_t count )

Initializes or sets device memory to a value.

devPtr
- Pointer to device memory
value
- Value to set for each byte of specified memory
count
- Size in bytes to set

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

Note that this function is asynchronous with respect to the host unless devPtr refers to pinned host memory.

See also:

cuMemsetD8, cuMemsetD16, cuMemsetD32

__host__ ​cudaError_t cudaMemset2D ( void* devPtr, size_t pitch, int  value, size_t width, size_t height )

Initializes or sets device memory to a value.

devPtr
- Pointer to 2D device memory
pitch
- Pitch in bytes of 2D device memory(Unused if height is 1)
value
- Value to set for each byte of specified memory
width
- Width of matrix set (columns in bytes)
height
- Height of matrix set (rows)
__host__ ​ __device__ ​cudaError_t cudaMemset2DAsync ( void* devPtr, size_t pitch, int  value, size_t width, size_t height, cudaStream_t stream = 0 )

Initializes or sets device memory to a value.

devPtr
- Pointer to 2D device memory
pitch
- Pitch in bytes of 2D device memory(Unused if height is 1)
value
- Value to set for each byte of specified memory
width
- Width of matrix set (columns in bytes)
height
- Height of matrix set (rows)
stream
- Stream identifier

Sets to the specified value value a matrix (height rows of width bytes each) pointed to by dstPtr. pitch is the width in bytes of the 2D array pointed to by dstPtr, including any padding added to the end of each row. This function performs fastest when the pitch is one that has been passed back by cudaMallocPitch().

cudaMemset2DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

The device version of this function only handles device to device copies and cannot be given local or shared pointers.

See also:

cudaMemset, cudaMemset2D, cudaMemset3D, cudaMemsetAsync, cudaMemset3DAsync, cuMemsetD2D8Async, cuMemsetD2D16Async, cuMemsetD2D32Async

__host__ ​cudaError_t cudaMemset3D ( cudaPitchedPtr pitchedDevPtr, int  value, cudaExtent extent )

Initializes or sets device memory to a value.

pitchedDevPtr
- Pointer to pitched device memory
value
- Value to set for each byte of specified memory
extent
- Size parameters for where to set device memory (width field in bytes)

Initializes each element of a 3D array to the specified value value. The object to initialize is defined by pitchedDevPtr. The pitch field of pitchedDevPtr is the width in memory in bytes of the 3D array pointed to by pitchedDevPtr, including any padding added to the end of each row. The xsize field specifies the logical width of each row in bytes, while the ysize field specifies the height of each 2D slice in rows. The pitch field of pitchedDevPtr is ignored when height and depth are both equal to 1.

The extents of the initialized region are specified as a width in bytes, a height in rows, and a depth in slices.

Extents with width greater than or equal to the xsize of pitchedDevPtr may perform significantly faster than extents narrower than the xsize. Secondarily, extents with height equal to the ysize of pitchedDevPtr will perform faster than when the height is shorter than the ysize.

This function performs fastest when the pitchedDevPtr has been allocated by cudaMalloc3D().

Note that this function is asynchronous with respect to the host unless pitchedDevPtr refers to pinned host memory.

See also:

cudaMemset, cudaMemset2D, cudaMemsetAsync, cudaMemset2DAsync, cudaMemset3DAsync, cudaMalloc3D, make_cudaPitchedPtr, make_cudaExtent

__host__ ​ __device__ ​cudaError_t cudaMemset3DAsync ( cudaPitchedPtr pitchedDevPtr, int  value, cudaExtent extent, cudaStream_t stream = 0 )

Initializes or sets device memory to a value.

pitchedDevPtr
- Pointer to pitched device memory
value
- Value to set for each byte of specified memory
extent
- Size parameters for where to set device memory (width field in bytes)
stream
- Stream identifier

Initializes each element of a 3D array to the specified value value. The object to initialize is defined by pitchedDevPtr. The pitch field of pitchedDevPtr is the width in memory in bytes of the 3D array pointed to by pitchedDevPtr, including any padding added to the end of each row. The xsize field specifies the logical width of each row in bytes, while the ysize field specifies the height of each 2D slice in rows. The pitch field of pitchedDevPtr is ignored when height and depth are both equal to 1.

The extents of the initialized region are specified as a width in bytes, a height in rows, and a depth in slices.

Extents with width greater than or equal to the xsize of pitchedDevPtr may perform significantly faster than extents narrower than the xsize. Secondarily, extents with height equal to the ysize of pitchedDevPtr will perform faster than when the height is shorter than the ysize.

This function performs fastest when the pitchedDevPtr has been allocated by cudaMalloc3D().

cudaMemset3DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

The device version of this function only handles device to device copies and cannot be given local or shared pointers.

See also:

cudaMemset, cudaMemset2D, cudaMemset3D, cudaMemsetAsync, cudaMemset2DAsync, cudaMalloc3D, make_cudaPitchedPtr, make_cudaExtent

__host__ ​ __device__ ​cudaError_t cudaMemsetAsync ( void* devPtr, int  value, size_t count, cudaStream_t stream = 0 )

Initializes or sets device memory to a value.

devPtr
- Pointer to device memory
value
- Value to set for each byte of specified memory
count
- Size in bytes to set
stream
- Stream identifier
__host__ ​cudaError_t cudaMipmappedArrayGetMemoryRequirements ( cudaArrayMemoryRequirements* memoryRequirements, cudaMipmappedArray_t mipmap, int  device )

Returns the memory requirements of a CUDA mipmapped array.

memoryRequirements
- Pointer to cudaArrayMemoryRequirements
mipmap
- CUDA mipmapped array to get the memory requirements of
device
- Device to get the memory requirements for
__host__ ​cudaError_t cudaMipmappedArrayGetSparseProperties ( cudaArraySparseProperties* sparseProperties, cudaMipmappedArray_t mipmap )

Returns the layout properties of a sparse CUDA mipmapped array.

sparseProperties
- Pointer to return cudaArraySparseProperties
mipmap
- The CUDA mipmapped array to get the sparse properties of
__host__ ​cudaExtent make_cudaExtent ( size_t w, size_t h, size_t d )

Returns a cudaExtent based on input parameters.

w
- Width in elements when referring to array memory, in bytes when referring to linear memory
h
- Height in elements
d
- Depth in elements
__host__ ​cudaPitchedPtr make_cudaPitchedPtr ( void* d, size_t p, size_t xsz, size_t ysz )

Returns a cudaPitchedPtr based on input parameters.

d
- Pointer to allocated memory
p
- Pitch of allocated memory in bytes
xsz
- Logical width of allocation in elements
ysz
- Logical height of allocation in elements
__host__ ​cudaPos make_cudaPos ( size_t x, size_t y, size_t z )

Returns a cudaPos based on input parameters.

x
- X position
y
- Y position
z
- Z position

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4