A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/OpenBMB/infllmv2_cuda_impl below:

GitHub - OpenBMB/infllmv2_cuda_impl

InfLLM V2 CUDA Kernel Implementation

English | 中文

This repository contains the optimized CUDA kernel implementation for InfLLM V2's Two-Stage Sparse Attention Mechanism. Our implementation provides high-performance kernels for both Stage 1 (Top-K Context Selection) and Stage 2 (Sparse Attention Computation), enabling Large Language Models (LLMs) to efficiently process long contexts with trainable sparse patterns.

InfLLM V2 introduces a novel two-stage approach for efficient long-context processing:

This CUDA kernel implementation includes both stages, providing:

Built upon FlashAttention, our kernels leverage efficient memory access patterns and optimized implementations for both stages.

Stage 1: Top-K Context Selection

The Top-K selection stage involves three sequential steps:

  1. Relevance Score Computation: Computing scores between query tokens and each semantic kernel (compressed representations of key-value blocks), followed by softmax normalization
  2. Score Aggregation: Aggregating relevance scores for each semantic kernel across the query group dimension using dimension reduction (hdim16_reduce)
  3. Block Selection (Post-processing): Selecting the top-K context blocks for each query token based on the aggregated scores

Note: The infllmv2_attn_stage1 kernel handles steps 1 and 2 (score computation and aggregation). Only step 3 (Top-K selection) is performed outside the kernel.

Stage 2: Sparse Attention Computation

The sparse attention stage performs standard attention computation, but only on the blocks selected in Stage 1:

Kernel Implementation Details For Training (main branch)
# Clone the repository and use main branch for training
git clone https://github.com/OpenBMB/infllm_v2_cuda.git --recursive
cd infllm_v2_cuda
git checkout main

# Install with CUDA kernel compilation
pip install -e .
For Hugging Face Inference (feature_infer branch)
# Clone the repository and use feature_infer branch for inference
git clone https://github.com/OpenBMB/infllm_v2_cuda.git --recursive
cd infllm_v2_cuda
git checkout feature_infer

# Install with CUDA kernel compilation
pip install -e .

The InfLLM V2 CUDA kernel provides the following interfaces for the two-stage sparse attention:

Stage 1: Attention Score Computation and Aggregation (feature_infer branch)
from infllm_v2 import infllmv2_attn_stage1

# Stage 1: Compute and aggregate relevance scores between queries and semantic kernels
# This kernel performs:
#   1. LSE approximation using compressed keys
#   2. Full attention score computation
#   3. Score aggregation across query group dimension (hdim16_reduce)
# Top-K selection must be performed separately on the aggregated scores
#
# Inputs:
#   - q: Query tensor (batch_size * n_heads, seqlen_q, head_dim)
#   - k: Compressed key tensor representing semantic kernels
#   - v: Placeholder tensor (not used in score computation)
#   - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths
#   - max_seqlen_q, max_seqlen_k: Maximum sequence lengths

# Returns aggregated attention scores for subsequent Top-K selection
aggregated_scores = infllmv2_attn_stage1(
    q, k, v,
    cu_seqlens_q=cu_seqlens_q,
    cu_seqlens_k=cu_seqlens_k,
    max_seqlen_q=max_seqlen_q,
    max_seqlen_k=max_seqlen_k,
    causal=True,  # Apply causal masking
    return_attn_probs=True  # Return attention scores
)

# Top-K selection should be performed on the returned aggregated scores
# (This step is not part of the kernel)
Stage 2: Sparse Attention Computation
from infllm_v2 import infllmv2_sparse_attn_func

# Stage 2: Sparse Attention Computation Kernel
# Inputs:
#   - q_unpad: Queries tensor (token-level)
#   - k_unpad, v_unpad: Keys and Values tensors (block-level)
#   - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths
#   - topk_idx: Selected block indices from Stage 1
#   - max_seqlen_q, max_seqlen_k: Maximum sequence lengths
#   - block_window_size: Optional local attention window size

out_unpad = infllmv2_sparse_attn_func(
    q_unpad, k_unpad, v_unpad,
    cu_seqlens_q, cu_seqlens_k,
    topk_idx,  # Block indices selected in Stage 1
    max_seqlen_q, max_seqlen_k,
    block_window_size = 0,  # Additional local window for attention
)
Performance Considerations Supported GPU Architectures Performance Comparison: InfLLMv2 vs FlashAttention

All benchmarks were conducted with the following configuration:

Detailed Performance Results Sequence Length Batch Size Implementation Forward (ms) Backward (ms) Combined (ms) Speedup vs FlashAttention 32,768 8 Flash Attention 201.46 526.62 728.08 1x 32,768 8 Triton NSA 169.11 343.82 512.93 1.42x 32,768 8 InfLLMv2 133.60 330.04 463.64 1.57x 65,536 4 Flash Attention 409.29 1037.46 1446.75 1x 65,536 4 Triton NSA 181.88 469.00 650.88 2.22x 65,536 4 InfLLMv2 142.31 381.55 523.86 2.76x 131,072 2 Flash Attention 831.77 2063.11 2894.88 1x 131,072 2 Triton NSA 216.10 589.66 805.76 3.59x 131,072 2 InfLLMv2 158.42 468.90 627.32 4.61x

If you use the InfLLM V2 CUDA kernels in your research, please cite:

@article{minicpm4,
  title={MiniCPM4: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM},
  year={2025}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4