RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/Graph.html below:

16. Graphs - Kokkos documentation

16. Graphs¶ 16.1. Usage¶

Kokkos::Graph is an abstraction that describes asynchronous workloads organised as a direct acyclic graph (DAG).

Once defined, the graph can be executed many times.

Kokkos::Graph is specialized for some backends:

Cuda
HIP
SYCL

On these backends, the Kokkos::Graph specialisations map to the native graph API, namely, the CUDA Graph API, the HIP Graph API, and the SYCL (command) Graph API, respectively.

For other backends, Kokkos::Graph provides a defaulted implementation.

16.1.1. Execution space instance versus graph¶

Workloads submitted on Kokkos execution space instances execute eagerly, i.e., once the Kokkos::parallel_ function is called, the workload is immediately launched on the device.

By contrast, the Kokkos::Graph abstraction follows lazy execution, i.e, workloads added to a Kokkos::Graph are not executed until the whole graph is ready and submitted.

16.1.2. Always in 3 phases¶

Typically, 3 phases are needed:

definition
instantiation
submission

The definition phase consists in describing the workloads: what they do, as well as their dependencies. In other words, this phase creates a topological graph of workloads.

The instantiation phase locks the topology, i.e., it cannot be changed anymore. During this phase, the graph will be checked for flaws. The backend creates an executable graph.

The last phase is submission. It will execute the workloads, observing their dependencies. This phase can be run multiple times.

16.1.3. Advantages¶

There are many advantages. Here are a few:

Since the workloads are described ahead of execution, the backend driver and/or compiler can leverage optimization opportunities.
Launch overhead is reduced, benefitting DAGs consisting of small workloads.

16.1.4. Capture¶

Some use cases might require adding nodes to a Kokkos::Graph with workloads that aren’t expressed in terms of Kokkos API but rather in native code, e.g., calling external math libraries like cuBLAS.

Such a scenario can be encountered in many situations like building and training a neural network, running a conjugate gradient method, and so on.

Capturing into a Kokkos::Graph boils down to writing the following snippet:

struct MyCudaCapture {
    ViewType data;

    void operator()(const Kokkos::Cuda& exec) const { ... }
};

...

auto my_captured_node = predecessor.cuda_capture(
    exec,
    MyCudaCapture{.data = my_data}
);

When the node is added to the Kokkos::Graph, the workloads are not directly dispatched to the device. Rather, the backend operations are “saved” for later “reuse” in the capture node.

Some important aspects of capture are worth pointing out:

The function object will be stored by the Kokkos::Graph instance, thereby ensuring that any data bound to the function object is guaranteed to stay alive until the graph is destroyed.
The execution space instance exec associates the captured workloads to a device.
While in “capture mode”, backend-specific restrictions may apply (see the Cuda programming guide for instance).

Warning

When a “stream” is used by multiple threads, capturing on one thread may affect other threads (search for cudaThreadExchangeStreamCaptureMode on Cuda runtime API documentation for instance).

For now, capture is only supported for the following backends:

Note

The SYCL documentation will use the term recording instead of capture, but it is essentially the same thing.

16.2. Examples¶ 16.2.1. Diamond DAG¶

Consider a diamond-like DAG.

The following snippet defines, instantiates and submits a Kokkos::Graph for this DAG.

auto graph = Kokkos::create_graph([&](auto root) {
    auto node_A = root.then_parallel_for("workload A", ...policy..., ...functor...);

    auto node_B = node_A.then_parallel_for("workload B", ...policy..., ...functor...);
    auto node_C = node_A.then_parallel_for("workload C", ...policy..., ...functor...);

    auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_for("workload D", ...policy..., ...functor...);
});

graph.instantiate();

graph.submit();

16.2.2. Capture of a cuBLAS call¶

This example shows how to create a node that captures a cuBLAS call. It also demonstrates how data is kept alive during the whole lifetime of the Kokkos::Graph (e.g. the cuBLAS handle).

#include "cublas_v2.h"

#include "Kokkos_Core.hpp"
#include "Kokkos_Graph.hpp"

#define ASSERT_EQ(a, b) if(a != b) throw std::runtime_error("There was a problem: " #a " is not equal to " #b);

#define CHECK_CUBLAS_CALL(call)                           \
  {                                                       \
    const auto error_code = call;                         \
    if(error_code != CUBLAS_STATUS_SUCCESS)               \
    {                                                     \
      printf("%s:%d: failure of statement %s: %s (%d)\n", \
        __FILE__, __LINE__,                               \
        #call,                                            \
        cublasGetStatusName(error_code), error_code);     \
      std::abort();                                       \
    }                                                     \
  }

template <typename MatrixType, typename VectorType>
void initialize_system(const Kokkos::Cuda& exec, const MatrixType& matrix, const VectorType& vector) {
  Kokkos::parallel_for(
    Kokkos::RangePolicy(exec, 0, matrix.extent(0)),
    KOKKOS_LAMBDA(const Kokkos::Cuda::size_type irow) {
      matrix(irow, 0) = 2 * irow + 1;
      matrix(irow, 1) = 2 * irow + 2;

      vector(irow) = irow + 5;
    }
  );
};

auto create_cublas_handle()
{
  cublasHandle_t handle = nullptr;

  CHECK_CUBLAS_CALL(cublasCreate(&handle));

  return std::shared_ptr<cublasContext>(handle, [](cublasHandle_t ptr) {
    CHECK_CUBLAS_CALL(cublasDestroy(ptr));
  });
}

template <typename Exec, typename Predecessor, typename MatrixType, typename VectorType, typename ResultType>
auto gemv(const Exec& exec, const Predecessor& predecessor, const MatrixType& matrix, const VectorType& vector, const ResultType& result)
{
  static_assert(std::is_same_v<typename MatrixType::value_type, double>);
  static_assert(std::is_same_v<typename VectorType::value_type, double>);
  static_assert(std::is_same_v<typename ResultType::value_type, double>);

  auto handle = create_cublas_handle();

  const double alpha = 1., beta = 1.;

  /// The @c handle is a shared resource stored in the lambda.
  /// Since the lambda is stored by the node, and the node won't be destroyed
  /// until the @c Kokkos::Graph is destroyed, the @c cuBLAS handle is
  /// guaranteed to stay alive for the whole graph lifetime.
  return predecessor.cuda_capture(
    exec,
    [=](const Kokkos::Cuda& exec_) {
      CHECK_CUBLAS_CALL(cublasSetStream(handle.get(), exec_.cuda_stream()));
      CHECK_CUBLAS_CALL(cublasDgemv(
        handle.get(),
        CUBLAS_OP_N,
        vector.size(),
        result.size(),
        &alpha,
        matrix.data(), vector.size(),
        vector.data(), 1,
        &beta,
        result.data(), 1
      ));
    }
  );
}

void test_capture()
{
  constexpr size_t nrows = 2, ncols = 2;

  using value_t    = double;
  using matrix_t   = Kokkos::View<value_t[nrows][ncols], Kokkos::LayoutRight, Kokkos::CudaSpace>;
  using vector_d_t = Kokkos::View<value_t[nrows],                             Kokkos::CudaSpace>;
  using vector_s_t = Kokkos::View<value_t[nrows],                             Kokkos::SharedSpace>;

  const Kokkos::Cuda exec {};

  const matrix_t   matrix(Kokkos::view_alloc(Kokkos::WithoutInitializing, exec, "matrix"));
  const vector_d_t vector(Kokkos::view_alloc(Kokkos::WithoutInitializing, exec, "vector"));
  const vector_s_t result(Kokkos::view_alloc(                             exec, "result"));

  initialize_system(exec, matrix, vector);

  auto graph = Kokkos::Experimental::create_graph([&](const auto& root) {
    auto node_gemv = gemv(exec, root, matrix, vector, result);
  });

  graph.instantiate();

  //! The views are stored in the graph node. No kernel ran yet.
  ASSERT_EQ(matrix.use_count(), 2);
  ASSERT_EQ(vector.use_count(), 2);
  ASSERT_EQ(result.use_count(), 2);
  ASSERT_EQ(result(0), 0);
  ASSERT_EQ(result(1), 0);

  //! Let's submit the graph twice, to ensure that the captured node behaves well.
  graph.submit(exec);

  Kokkos::deep_copy(exec, vector, result);

  graph.submit(exec);

  exec.fence();

  ASSERT_EQ(result(0), 23 + 125);
  ASSERT_EQ(result(1), 34 + 182);
}

int main(int argc, char *argv[]) {
  Kokkos::initialize(argc, argv);
  test_capture();
  Kokkos::finalize();
  return EXIT_SUCCESS;
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4