GPU-accelerated computing, which has been introduced in 2000's and helped to start AI revolution in 2011, is one of the main trends nowadays. Performance of GPUs and dedicated AI accelerators (called NPU's (neural processing units), TPU's (tensor processing units) etc.) increases significantly faster than the performance of CPUs. Those GPUs/NPUs are now equipped with special instructions and extended capabilities to run various sophisticated algorithms.
Until ~2012 OpenCV was purely CPU library, even though special optimizations using parallel loops and vector instructions have been actively added. That CPU-based acceleration direction is still relevant, see #25019. Then, we introduced CUDA-based acceleration modules in OpenCV, currently moved to opencv_contrib. In OpenCV 3.0 we also introduced OpenCL-based Transparent API (T-API).
Besides using CUDA and OpenCL to accelerate basic functionality, we also added CUDA- and OpenCL-based backends in our Deep Learning inference module (OpenCV DNN) introduced in 2015. OpenCV DNN also includes other backends that use other standard or proprietary acceleration APIs, like Vulkan-based backend, CANN-based backend for Huawei Ascend, TimVX/OpenVX-based backend for Amlogic NPU etc.
There are several serious problems with the current approach that we want to solve in OpenCV 5.0, namely:
UMat
for OpenCL, GpuMat
for CUDA, AclMat
for CANN etc. It's very inconvenient for OpenCV itself and for user applications.Operations derived from the same base class and following certain protocol:
cv::hal::BaseOp
for example) and will have to implement certain protocol, like shape&type inference, evaluation of scratch buffer size, asynchronous execution etc.Data structures:
cv::UMat
data type.UMat
instances were all wrappers on top of OpenCL buffers when OpenCL runtime engine is detected, or they all were wrappers on top of system memory buffers when no OpenCL is used. In OpenCV 5.0 different instances of UMat
may be allocated and handled by different backends, for example:using namespace cv;
// capture video frame stored in OpenGL texture (perhaps, we don't need to know)
UMat frame;
vidcap >> frame;
// convert frame from OpenGL or whatever representation to CUDA buffer stored at 0-th NVidia GPU, installed in the system.
// in general case transferring data from one space to another will be done via system memory,
// but some backends will provide more efficient mechanisms
UMat cuda_frame = frame.upload(Device::NVGPU_(0));
UMat cuda_processed_frame;
// filter the frame, result will be placed onto the same device
GaussianBlur(cuda_frame, cuda_processed_frame, Size(11, 11), 3, 3);
// retrieve the result as cv::Mat for the further custom processing.
Mat result = cuda_processed_frame.getMat(Access::READ_WRITE);
...
Devices, Allocators:
Device
and some basic non-CPU HAL functions and classes, including memory allocators, will take Device*
as parameter. nullptr
could be used as equivalent for 'CPU device'. For other cases there will be helper functions that will return proper pointers, e.g. Device::NVGPU_(device_index)
, Device::GPU_(device_index)
(any GPU), Device::defaultAccelerator()
(any GPU or NPU or CPU) etc.UMat
instance could return the device where it's located.UMatAllocator
. It can allocate memory on the specified device, deallocate memory, upload memory block to specified device, download memory block to system memory, transfer memory to another device (via intermediate download()
or directly, if possible), copy memory within the same device, initialize memory block with the specified scalar value, map/unmap memory to/from system memory if the device supports zero copy (if not, a physical copy is made).Streams:
If possible, acceleration backends should be able to run operations asyncronously, as it's done now in Transparent API. That is, GaussianBlur()
in the example above and most other functions return control back to user once an operation is scheduled for execution. Operations that return memory buffers in system memory (like UMat::getMat()
) shall insert synchronization point before returning this buffer.
Typically, asyncronoous execution is done using streams, a.k.a. queues in OpenCL terminology. A stream/queue is an placeholder for new tasks that one wants to execute sequentially. Note that in the proposed API (just like in Transparent API) users don't have to deal with streams directly.
Different subsequent operations may reuse the same scratch buffer. If a scratch buffer needs to be reallocated (increased), syncronization point must be inserted to make sure that all previous operations that may use the same scratch buffer are finished.
Temporary UMat's need to be protected. That is, some complex functions may use temporary UMat's.
void foo(const UMat& input, UMat& output) {
UMat temp1, temp2;
op1(input, temp1);
op2(temp1, temp2);
op3(temp2, output);
}
since all the operations op*
may be asyncronous, foo() may finish before all of the op*
will finish and so temp1
and temp2
as local variables may be destructed while the operations are still running or waiting for execution in a stream. To protect them from premature release, each backend, if it provides asyncronous execution, must increment reference counter of all UMat arguments of each operation that is put into the stream, catch event when each operation is finished and then decrement reference counters back. This mechanism is already implemented in Transparent API.
There should probably be a special 'graph mode' flag for fast, light-weight scheduling of operations without extra protection of temporary matrices. This will be useful for OpenCV DNN backends.
We want to ensure that different backends can be built separately from OpenCV and then dynamically connected to OpenCV. In order to do that, non-CPU HAL API should be put into a separate header that will not use C++ classes (well, there can be pure abstract C++ classes, pointers to which are returned by extern "C"
functions).
Some acceleration backends, like OpenCL or Vulkan, support JIT compilation, which adds a lot of flexibility and more opportunies for efficient graph fusion, and also let us to decrease the footprint drammaticaly (we don't have to put all precompiled kernels into binary, we may request backend to compile necessary kernels on the fly and store them in on-disk cache). There will be a dedicated API to allow such on-fly compilation.
All OpenCV functions for which there are corresponding non-CPU HAL entries, e.g. cv::gemm()
shall be modified to use the generalized dispatcher scheme, something like:
void cv::foo(InputArray a, InputArray b, OutputArray c, const FooParams& params)
{
c.fitSameMemory(a, a.size, a.type);
// get default backend for the device where 'a' is located
hal::Backend* backend = a.backend();
if (backend && backend->supports(CV_HAL_ID_FOO)) {
CV_Assert(a.sameMemorySpace(b)); // some functions may support mixed-space ops as well
// retrieve kernel pointer for the concrete set of input parameters.
// Backends with JIT support may generate such a kernel on-fly
// OpeCV DNN or other OpenCV components or user's applications with a 'state' may
// retrieve those kernels once and store them for faster access.
auto hal_foo_ptr = backend->get_foo_kernel(a.type(), b.type(), params);
if (hal_foo_ptr && backend->run(hal_foo_ptr, {a, b, c}, params))
return;
}
auto a_ = a.getMat(), b_ = b.getMat(), c_ = c.getMat();
// fallback to CPU
}
For OpenCV 5.0 the minimum plan is to introduce non-CPU HAL API, probably as a draft specification, and implement at least one backend, most probably OpenCL. After that, in 5.x we can create more backends, for example, CUDA backend.
fengyuentau and Lan-tbSLchowis
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4