RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/deepseek-ai/DeepEP/issues/39 below:

Easier Potential Overall Design · Issue #39 · deepseek-ai/DeepEP · GitHub

For extreme GPU memory saving, we currently use communication queues for NVLink and RDMA buffers. This means tokens cyclically reuse a small buffer - when the queue is full, no new tokens are transmitted, and transfers only occur when the queue has space.

However, this approach has drawbacks: it can potentially cause deadlocks, repeatedly polling the queue introduces latency, and reaching maximum performance requires complex implementation. You can see this reflected in our internode code, where adding new features comes at a significant cost.

If you're referencing our code but want to design your own implementation, we also suggest a simpler overall design for your consideration:

Allocate buffers directly based on the maximum possible number of tokens (which might be very large)
This allows direct address calculation when sending, eliminating the need for a dynamic queue
Considering that MoE training tends to have relatively uniform token distribution when stable:
- Implement a dynamic buffer resizing strategy
- Expand the buffer when any rank's buffer size is insufficient
- Shrink the buffer when it hasn't been fully utilized for an extended period
Maybe SM-free can be achieved

Overall, this approach might use more GPU memory (the exact amount depends on the specific scenario), but the implementation would be much simpler. You could more easily add new features, and the performance ceiling might be slightly better.

Thanks to @KnowingNothing from ByteDance for discussing and suggesting this approach!

GHGmc2, stoensin, oyanghd, zartbot, Qizhi697 and 31 moreMengAiDev

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4