Kokkos::Graph
is an abstraction that describes asynchronous workloads organised as a direct acyclic graph (DAG).
Once defined, the graph can be executed many times.
Kokkos::Graph
is specialized for some backends:
Cuda
HIP
SYCL
On these backends, the Kokkos::Graph
specialisations map to the native graph API, namely, the CUDA Graph API, the HIP Graph API, and the SYCL (command) Graph API, respectively.
For other backends, Kokkos::Graph
provides a defaulted implementation.
Workloads submitted on Kokkos
execution space instances execute eagerly, i.e., once the Kokkos::parallel_
function is called, the workload is immediately launched on the device.
By contrast, the Kokkos::Graph
abstraction follows lazy execution, i.e, workloads added to a Kokkos::Graph
are not executed until the whole graph is ready and submitted.
Typically, 3 phases are needed:
definition
instantiation
submission
The definition phase consists in describing the workloads: what they do, as well as their dependencies. In other words, this phase creates a topological graph of workloads.
The instantiation phase locks the topology, i.e., it cannot be changed anymore. During this phase, the graph will be checked for flaws. The backend creates an executable graph.
The last phase is submission. It will execute the workloads, observing their dependencies. This phase can be run multiple times.
16.1.3. Advantages¶There are many advantages. Here are a few:
Since the workloads are described ahead of execution, the backend driver and/or compiler can leverage optimization opportunities.
Launch overhead is reduced, benefitting DAGs consisting of small workloads.
Some use cases might require adding nodes to a Kokkos::Graph
with workloads that aren’t expressed in terms of Kokkos
API but rather in native code, e.g., calling external math libraries like cuBLAS.
Such a scenario can be encountered in many situations like building and training a neural network, running a conjugate gradient method, and so on.
Capturing into a Kokkos::Graph
boils down to writing the following snippet:
struct MyCudaCapture { ViewType data; void operator()(const Kokkos::Cuda& exec) const { ... } }; ... auto my_captured_node = predecessor.cuda_capture( exec, MyCudaCapture{.data = my_data} );
When the node is added to the Kokkos::Graph
, the workloads are not directly dispatched to the device. Rather, the backend operations are “saved” for later “reuse” in the capture node.
Some important aspects of capture are worth pointing out:
The function object will be stored by the Kokkos::Graph
instance, thereby ensuring that any data bound to the function object is guaranteed to stay alive until the graph is destroyed.
The execution space instance exec associates the captured workloads to a device.
While in “capture mode”, backend-specific restrictions may apply (see the Cuda programming guide for instance).
Warning
When a “stream” is used by multiple threads, capturing on one thread may affect other threads (search for cudaThreadExchangeStreamCaptureMode
on Cuda runtime API documentation for instance).
For now, capture is only supported for the following backends:
Note
The SYCL
documentation will use the term recording instead of capture, but it is essentially the same thing.
Consider a diamond-like DAG.
The following snippet defines, instantiates and submits a Kokkos::Graph
for this DAG.
auto graph = Kokkos::create_graph([&](auto root) { auto node_A = root.then_parallel_for("workload A", ...policy..., ...functor...); auto node_B = node_A.then_parallel_for("workload B", ...policy..., ...functor...); auto node_C = node_A.then_parallel_for("workload C", ...policy..., ...functor...); auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_for("workload D", ...policy..., ...functor...); }); graph.instantiate(); graph.submit();16.2.2. Capture of a cuBLAS call¶
This example shows how to create a node that captures a cuBLAS call. It also demonstrates how data is kept alive during the whole lifetime of the Kokkos::Graph
(e.g. the cuBLAS handle).
#include "cublas_v2.h" #include "Kokkos_Core.hpp" #include "Kokkos_Graph.hpp" #define ASSERT_EQ(a, b) if(a != b) throw std::runtime_error("There was a problem: " #a " is not equal to " #b); #define CHECK_CUBLAS_CALL(call) \ { \ const auto error_code = call; \ if(error_code != CUBLAS_STATUS_SUCCESS) \ { \ printf("%s:%d: failure of statement %s: %s (%d)\n", \ __FILE__, __LINE__, \ #call, \ cublasGetStatusName(error_code), error_code); \ std::abort(); \ } \ } template <typename MatrixType, typename VectorType> void initialize_system(const Kokkos::Cuda& exec, const MatrixType& matrix, const VectorType& vector) { Kokkos::parallel_for( Kokkos::RangePolicy(exec, 0, matrix.extent(0)), KOKKOS_LAMBDA(const Kokkos::Cuda::size_type irow) { matrix(irow, 0) = 2 * irow + 1; matrix(irow, 1) = 2 * irow + 2; vector(irow) = irow + 5; } ); }; auto create_cublas_handle() { cublasHandle_t handle = nullptr; CHECK_CUBLAS_CALL(cublasCreate(&handle)); return std::shared_ptr<cublasContext>(handle, [](cublasHandle_t ptr) { CHECK_CUBLAS_CALL(cublasDestroy(ptr)); }); } template <typename Exec, typename Predecessor, typename MatrixType, typename VectorType, typename ResultType> auto gemv(const Exec& exec, const Predecessor& predecessor, const MatrixType& matrix, const VectorType& vector, const ResultType& result) { static_assert(std::is_same_v<typename MatrixType::value_type, double>); static_assert(std::is_same_v<typename VectorType::value_type, double>); static_assert(std::is_same_v<typename ResultType::value_type, double>); auto handle = create_cublas_handle(); const double alpha = 1., beta = 1.; /// The @c handle is a shared resource stored in the lambda. /// Since the lambda is stored by the node, and the node won't be destroyed /// until the @c Kokkos::Graph is destroyed, the @c cuBLAS handle is /// guaranteed to stay alive for the whole graph lifetime. return predecessor.cuda_capture( exec, [=](const Kokkos::Cuda& exec_) { CHECK_CUBLAS_CALL(cublasSetStream(handle.get(), exec_.cuda_stream())); CHECK_CUBLAS_CALL(cublasDgemv( handle.get(), CUBLAS_OP_N, vector.size(), result.size(), &alpha, matrix.data(), vector.size(), vector.data(), 1, &beta, result.data(), 1 )); } ); } void test_capture() { constexpr size_t nrows = 2, ncols = 2; using value_t = double; using matrix_t = Kokkos::View<value_t[nrows][ncols], Kokkos::LayoutRight, Kokkos::CudaSpace>; using vector_d_t = Kokkos::View<value_t[nrows], Kokkos::CudaSpace>; using vector_s_t = Kokkos::View<value_t[nrows], Kokkos::SharedSpace>; const Kokkos::Cuda exec {}; const matrix_t matrix(Kokkos::view_alloc(Kokkos::WithoutInitializing, exec, "matrix")); const vector_d_t vector(Kokkos::view_alloc(Kokkos::WithoutInitializing, exec, "vector")); const vector_s_t result(Kokkos::view_alloc( exec, "result")); initialize_system(exec, matrix, vector); auto graph = Kokkos::Experimental::create_graph([&](const auto& root) { auto node_gemv = gemv(exec, root, matrix, vector, result); }); graph.instantiate(); //! The views are stored in the graph node. No kernel ran yet. ASSERT_EQ(matrix.use_count(), 2); ASSERT_EQ(vector.use_count(), 2); ASSERT_EQ(result.use_count(), 2); ASSERT_EQ(result(0), 0); ASSERT_EQ(result(1), 0); //! Let's submit the graph twice, to ensure that the captured node behaves well. graph.submit(exec); Kokkos::deep_copy(exec, vector, result); graph.submit(exec); exec.fence(); ASSERT_EQ(result(0), 23 + 125); ASSERT_EQ(result(1), 34 + 182); } int main(int argc, char *argv[]) { Kokkos::initialize(argc, argv); test_capture(); Kokkos::finalize(); return EXIT_SUCCESS; }
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4