When I conducted a stability stress test with 400 concurrent requests of varying lengths, the unbootstrap queue in the prefill stage kept piling up. Additionally, the token_usage on the decode side approached 1. Due to the timeout mechanism in our link (e.g., automatically disconnecting when TTFT exceeds 60s), the business layer actively terminates the connection with the sglang HTTP server. However, it appears that the decode side fails to release resources properly, leading to the accumulation of unbootstrap tasks in the prefill stage and eventually causing the decode side to hang.
Decode Prefill ReproductionLook Above
EnvironmentH20
Prefill 2-node
Decode 4-node
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4