A Dataflow library for graph analytics acceleration
Enumeration of GraphOps ComponentsData blocks are the primary GraphOps components. They handle incoming data streams, perform arithmetic operations, and route outputs to memory or subsequent blocks:
In GraphOps, the majority of the logic is amenable to dataflow. One key reason for this is that feedback control is rare. There are situations, however, that call for more intricate control difficult to express without state machines. The control blocks are embedded inside the data blocks and are responsible for handling these situations. They are as follows:
Additional logic is needed to properly interface with the memory system and the host platform. These are realized via the utility blocks:
In the GraphOps library, as with any streaming system, proper matching of throughput rates and buffering are critical to achieving correct execution. For example, the metadata packets in NbrPropRdr arrive at a much higher rate than the data packets. The input buffers shown in the figure are sized to account for this. In addition, the memory-requesting blocks reduce their output rate using throttling. In the ForAllPropRdr block, for instance, the memory request and metadata emission rates are reduced by a factor of two. The throttling ratio is easily modified using a single parameter, local to the requesting block. This design allows NbrPropRed to better match the memory throughput and prevent overflow in its input buffers.
Despite best efforts to match throughputs across different blocks, it is difficult to account for all corner cases of execution. For example, in a highly irregular graph, NbrPropRed may spend a disproportionate amount of time reducing one particularly large neighbor set. Meanwhile, the edge list pointer buffers may overflow. To address this potential issue, we simply generate a stall signal which gets emitted "upstream" towards the preceding block. Each block can propagate a downstream stall request by or'ing the incoming stall signal with the generated stall before emitting the result (Stall output).
Halting execution and Host InteractionThere needs to be an extensible and simple system to determine when the hardware has completed execution. To address this issue, each block maintains its own notion of when it has finished. For NbrPropRed, the block is completed when the edge list input buffers have been cleared after a period of execution. The EndSignal block collects all of the done signals and sends an interrupt to the host system when all done signals are stable. (Add images)
FifoKernelSM Parallel Queue(Add images) Figure~\ref{fig:FifoKernelSM} shows a diagram of a parallel queue constructed using the FifoKernelSM control block. This structure addresses a situation commonly encountered in the GraphOps parallel processing framework. As described in the NbrPropRed architecture, incoming memory data packets carry multiple potential elements. Dynamic masks define which of them are buffered and which are discarded. In this scenario, a control block is needed which allows for round robin selection from among only those buffers which have data. When a buffer is selected, a dequeue signal needs to be automatically sent to only that buffer, leaving other queues intact. This tight feedback operation necessitates the use of special control logic.
The FifoKernelSM enables this automated parallel queue by providing a dataReady control output, in addition to the standard FIFO interface signals provided by the underlying FIFO primitive (e.g. Xilinx FIFO block). The dataReady vector goes through a combinational logic block which selects one entry from among the buffers with data available. This one-hot readEnable vector is fed back into the buffers as the dequeue input.
Further Notes on ApplicationsThe number of MemUnit blocks used by each accelerator corresponds to the number of unique memory request interfaces in the system. This includes one interface used to issue the end-of-execution interrupt. As described earlier, the memory controller in the underlying hardware system offers a single channel, shared by all request queues via multiplexing. There are no ordering guarantees among the request queues.
From the lists of blocks used, it is evident that several blocks are used in multiple accelerators. This supports the observation that there are common computational paradigms in an interesting set of graph analytics algorithms. The GraphOps library makes building accelerators for these types of algorithms more accessible.
GraphOps is licensed under the MIT License.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4