For extreme GPU memory saving, we currently use communication queues for NVLink and RDMA buffers. This means tokens cyclically reuse a small buffer - when the queue is full, no new tokens are transmitted, and transfers only occur when the queue has space.
However, this approach has drawbacks: it can potentially cause deadlocks, repeatedly polling the queue introduces latency, and reaching maximum performance requires complex implementation. You can see this reflected in our internode code, where adding new features comes at a significant cost.
If you're referencing our code but want to design your own implementation, we also suggest a simpler overall design for your consideration:
Overall, this approach might use more GPU memory (the exact amount depends on the specific scenario), but the implementation would be much simpler. You could more easily add new features, and the performance ceiling might be slightly better.
Thanks to @KnowingNothing from ByteDance for discussing and suggesting this approach!
GHGmc2, stoensin, oyanghd, zartbot, Qizhi697 and 31 moreMengAiDev
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4