Build a fastest OSS vllm-based speculative decoding system for your own model, using ArcticTraining and ArcticInference!
We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:
method ShareGPT HumanEval VLLM V1 Baseline 84.1 84.1 VLLM V1 Eagle 102.2 112.0 VLLM V1 Eagle3 77.7 85.3 VLLM V0 MLP-Speculator (IBM) 77.9 66.7 ArcticSpeculator 172.4 203.7For more details about ArcticSpeculator and how to use it:
We also release ArcticSpeculator checkpoints we trained with ArcticTraining to run with ArcticInference:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4