A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://nbviewer.org/github/triton-inference-server/fil_backend/blob/main/notebooks/faq/FAQs.ipynb below:

Jupyter Notebook Viewer

Example 9.2: Maximizing Throughput

We now turn to the opposite extreme and attempt to maximize throughput without any consideration for latency. While the absolute-minimum latency configuration rarely makes sense, there are many applications for which latency can be arbitrarily high. If you are batch-processing a large volume of data without real-time client interaction, throughput is likely to be the only reblevant metric.

$\color{#76b900}{\text{Step 1: Use GPU execution}}$

Maximum throughput for any model deployed with the FIL backend will be substantially higher on GPU than on CPU. In fact, this increased throughput more than offsets the increased per-hour cost of most GPU cloud instances, resulting in cost savings of around 50% for many typical model deployments relative to CPU execution.

Step 2: Use dynamic batching

By combining many small requests from clients into one large server-side batch, we can take maximal advantage of the highly-optimized parallel execution of FIL.

Step 3: Increase client-side batch size (if possible)

If your application allows it, increasing the size of client batches can further increase performance by avoiding the overhead of combining input data and scattering output data.

$\color{#76b900}{\text{Step 4: Try storage_type DENSE (if possible)}}$

If you have enough memory available, switching to a storage_type of DENSE may improve performance. This is model-dependent, however, and the impact may be small. We will not attempt this here, since you may be working with a model that would not fit in memory with a dense representation.

$\color{#76b900}{\text{Step 5 (Advanced): Experiment with algo options}}$

We can use this configuration option to explicitly specify how FIL will lay out its trees and progress through them during inference. The following options are available:

It is difficult to say a priori which option will be most suitable for your model and deployment configuration, so ALGO_AUTO is generally a safe choice. Nevertheless, we will demonstrate the TREE_REORG option, since it provided about a 10% improvement in throughput for the model used while testing this notebook.

NOTE: Only ALGO_AUTO and NAIVE can be used with storage_type SPARSE or SPARSE8

$\color{#76b900}{\text{Step 6 (Advanced): Experiment with the blocks_per_sm option}}$

This option lets us explicitly set the number of CUDA blocks per streaming multiprocessor that will be used for the GPU inference kernel. For very large models, this can improve the cache hit rate, resulting in a modest improvement in performance. To experiment with this option, set it to any value between 2 and 7.

As with algo selection, it is difficult to say what the impact of tweaking this option will be. In the model used while testing this notebook, no value offered better performance than the default of 0, which allows FIL to select the blocks per SM via a different method.

$\color{#76b900}{\text{Step 7 (Advanced): Experiment with the threads_per_tree option}}$

This option lets us increase the number of CUDA threads used for inference on a single tree above 1, but it results in increased shared memory usage. Because of this, we will not experiment with this value in this notebook.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4