cache-dit
.You can install the stable release of cache-dit
from PyPI:
pip3 install -U cache-dit
Or you can install the latest develop version from GitHub:
pip3 install git+https://github.com/vipshop/cache-dit.git
DBCache: Dual Block Caching for Diffusion Transformers. We have enhanced FBCache
into a more general and customizable cache algorithm, namely DBCache
, enabling it to achieve fully UNet-style
cache acceleration for DiT models. Different configurations of compute blocks (F8B12, etc.) can be customized in DBCache. Moreover, it can be entirely training-free. DBCache can strike a perfect balance between performance and precision!
DBCache, L20x1 , Steps: 28, "A cat holding a sign that says hello world with complex background"
DBCache, L20x4 , Steps: 20, case to show the texture recovery ability of DBCache
These case studies demonstrate that even with relatively high thresholds (such as 0.12, 0.15, 0.2, etc.) under the DBCache F12B12 or F8B16 configuration, the detailed texture of the kitten's fur, colored cloth, and the clarity of text can still be preserved. This suggests that users can leverage DBCache to effectively balance performance and precision in their workflows!
DBCache provides configurable parameters for custom optimization, enabling a balanced trade-off between performance and precision:
For a good balance between performance and precision, DBCache is configured by default with F8B0, 8 warmup steps, and unlimited cached steps.
import cache_dit from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") # Default options, F8B0, good balance between performance and precision cache_options = cache_dit.default_options(cache_dit.DBCache) # Custom options, F8B8, higher precision cache_options = { "cache_type": cache_dit.DBCache, "warmup_steps": 8, "max_cached_steps": -1, # -1 means no limit "Fn_compute_blocks": 8, # Fn, F8, etc. "Bn_compute_blocks": 8, # Bn, B8, etc. "residual_diff_threshold": 0.12, } cache_dit.enable_cache(pipe, **cache_options)
Moreover, users configuring higher Bn values (e.g., F8B16) while aiming to maintain good performance can specify Bn_compute_blocks_ids to work with Bn. DBCache will only compute the specified blocks, with the remaining estimated using the previous step's residual cache.
# Custom options, F8B16, higher precision with good performance. cache_options = { # 0, 2, 4, ..., 14, 15, etc. [0,16) "Bn_compute_blocks_ids": cache_dit.block_range(0, 16, 2), # If the L1 difference is below this threshold, skip Bn blocks # not in `Bn_compute_blocks_ids`(1, 3,..., etc), Otherwise, # compute these blocks. "non_compute_blocks_diff_threshold": 0.08, }
We have supported the TaylorSeers: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers algorithm to further improve the precision of DBCache in cases where the cached steps are large, namely, Hybrid TaylorSeer + DBCache. At timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, significantly harming the generation quality.
$$ \mathcal{F}_{\text {pred }, m}\left(x_{t-k}^l\right)=\mathcal{F}\left(x_t^l\right)+\sum_{i=1}^m \frac{\Delta^i \mathcal{F}\left(x_t^l\right)}{i!\cdot N^i}(-k)^i $$
TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. The TaylorSeer implemented in CacheDiT supports both hidden states and residual cache types. That is $\mathcal{F}_{\text {pred }, m}\left(x_{t-k}^l\right)$ can be a residual cache or a hidden-state cache.
cache_options = { # TaylorSeer options "enable_taylorseer": True, "enable_encoder_taylorseer": True, # Taylorseer cache type cache be hidden_states or residual. "taylorseer_cache_type": "residual", # Higher values of n_derivatives will lead to longer # computation time but may improve precision significantly. "taylorseer_kwargs": { "n_derivatives": 2, # default is 2. }, "warmup_steps": 3, # prefer: >= n_derivatives + 1 "residual_diff_threshold": 0.12, }
Important
Please note that if you have used TaylorSeer as the calibrator for approximate hidden states, the Bn param of DBCache can be set to 0. In essence, DBCache's Bn is also act as a calibrator, so you can choose either Bn > 0 or TaylorSeer. We recommend using the configuration scheme of TaylorSeer + DBCache FnB0.
DBCache F1B0 + TaylorSeer, L20x1, Steps: 28,
"A cat holding a sign that says hello world with complex background"
CacheDiT supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG (classifier-free guidance) in the forward step, please set do_separate_classifier_free_guidance
param to False (default). Otherwise, set it to True. For examples:
cache_options = { # CFG: classifier free guidance or not # For model that fused CFG and non-CFG into single forward step, # should set do_separate_classifier_free_guidance as False. # For example, set it as True for Wan 2.1 and set it as False # for FLUX.1, HunyuanVideo, CogVideoX, Mochi. "do_separate_classifier_free_guidance": True, # Wan 2.1, Qwen-Image # Compute cfg forward first or not, default False, namely, # 0, 2, 4, ..., -> non-CFG step; 1, 3, 5, ... -> CFG step. "cfg_compute_first": False, # Compute spearate diff values for CFG and non-CFG step, # default True. If False, we will use the computed diff from # current non-CFG transformer step for current CFG step. "cfg_diff_compute_separate": True, }⚡️DBPrune: Dynamic Block Prune
We have further implemented a new Dynamic Block Prune algorithm based on Residual Caching for Diffusion Transformers, which is referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then dynamically prunes blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.
import cache_dit from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") # Using DBPrune with default options cache_dit.enable_cache( pipe, **cache_dit.default_options(cache_dit.DBPrune) )
We have also brought the designs from DBCache to DBPrune to make it a more general and customizable block prune algorithm. You can specify the values of Fn and Bn for higher precision, or set up the non-prune blocks list non_prune_blocks_ids to avoid aggressive pruning. For example:
# Custom options for DBPrune cache_options = { "cache_type": cache_dit.DBPrune, "residual_diff_threshold": 0.05, # Never prune the first `Fn` and last `Bn` blocks. "Fn_compute_blocks": 8, # default 1 "Bn_compute_blocks": 8, # default 0 "warmup_steps": 8, # default -1 # Disables the pruning strategy when the previous # pruned steps greater than this value. "max_pruned_steps": 12, # default, -1 means no limit # Enable dynamic prune threshold within step, higher # `max_dynamic_prune_threshold` value may introduce a more # ageressive pruning strategy. "enable_dynamic_prune_threshold": True, "max_dynamic_prune_threshold": 2 * 0.05, # (New thresh) = mean(previous_block_diffs_within_step) * 1.25 # (New thresh) = ((New thresh) if (New thresh) < # max_dynamic_prune_threshold else residual_diff_threshold) "dynamic_prune_threshold_relax_ratio": 1.25, # The step interval to update residual cache. For example, # 2: means the update steps will be [0, 2, 4, ...]. "residual_cache_update_interval": 1, # You can set non-prune blocks to avoid ageressive pruning. # For example, FLUX.1 has 19 + 38 blocks, so we can set it # to 0, 2, 4, ..., 56, etc. "non_prune_blocks_ids": [], } cache_dit.enable_cache(pipe, **cache_options)
Important
Please note that for GPUs with lower VRAM, DBPrune may not be suitable for use on video DiTs, as it caches the hidden states and residuals of each block, leading to higher GPU memory requirements. In such cases, please use DBCache, which only caches the hidden states and residuals of 2 blocks.
DBPrune, L20x1 , Steps: 28, "A cat holding a sign that says hello world with complex background"
Currently, for any diffusion models with transformer blocks that match the specific input/output pattern, we can use the Unified Cache APIs from cache-dit. Please refer to run_qwen_image_uapi.py as an example.
(IN: hidden_states, encoder_hidden_states) -> (OUT: hidden_states, encoder_hidden_states)
The Unified Cache APIs are currently in the experimental phase, please stay tuned for updates.
import cache_dit from diffusers import QwenImagePipeline # Can be [Any] Diffusion Pipeline pipe = QwenImagePipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, ) cache_dit.enable_cache( pipe, transformer=pipe.transformer, blocks=pipe.transformer.transformer_blocks, return_hidden_states_first=False, **cache_options, )
By the way, CacheDiT is designed to work compatibly with torch.compile. You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:
cache_dit.enable_cache( pipe, **cache_dit.default_options(cache_dit.DBPrune) ) # Compile the Transformer module pipe.transformer = torch.compile(pipe.transformer)
However, users intending to use CacheDiT for DiT with dynamic input shapes should consider increasing the recompile limit of torch._dynamo
. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.
torch._dynamo.config.recompile_limit = 96 # default is 8 torch._dynamo.config.accumulated_recompile_limit = 2048 # default is 256
Please check bench.py for more details.
You can utilize the APIs provided by CacheDiT to quickly evaluate the accuracy losses caused by different cache configurations. For example:
from cache_dit.metrics import compute_psnr from cache_dit.metrics import compute_video_psnr from cache_dit.metrics import FrechetInceptionDistance # FID FID = FrechetInceptionDistance() image_psnr, n = compute_psnr("true.png", "test.png") # Num: n image_fid, n = FID.compute_fid("true_dir", "test_dir") video_psnr, n = compute_video_psnr("true.mp4", "test.mp4") # Frames: n
Please check test_metrics.py for more details. Or, you can use cache-dit-metrics-cli
tool. For examples:
cache-dit-metrics-cli -h # show usage # all: PSNR, FID, SSIM, MSE, ..., etc. cache-dit-metrics-cli all -i1 true.png -i2 test.png # image cache-dit-metrics-cli all -i1 true_dir -i2 test_dir # image dir cache-dit-metrics-cli all -v1 true.mp4 -v2 test.mp4 # video cache-dit-metrics-cli all -v1 true_dir -v2 test_dir # video dir cache-dit-metrics-cli fid -i1 true_dir -i2 test_dir # FID cache-dit-metrics-cli psnr -i1 true_dir -i2 test_dir # PSNR
How to contribute? Star ⭐️ this repo to support us or check CONTRIBUTE.md.
The CacheDiT codebase is adapted from FBCache. Special thanks to their excellent work! We have followed the original License from FBCache, please check LICENSE for more details.
@misc{CacheDiT@2025, title={CacheDiT: A Training-free and Easy-to-use cache acceleration Toolbox for Diffusion Transformers}, url={https://github.com/vipshop/cache-dit.git}, note={Open-source software available at https://github.com/vipshop/cache-dit.git}, author={vipshop.com}, year={2025} }
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4