Large language models (LLMs) are growing ever more sophisticated, but their expanding capabilities come with a hunger for ever-increasing computational resources. To serve these complex models efficiently and cost-effectively, a potent combination of cutting-edge hardware and meticulously crafted software is essential. This is precisely where the NVIDIA H100 Tensor Core GPUs and TensorRT-LLM software come into play.
The Rise of Mixture-of-Experts (MoE) Architectures
A recent innovation in the realm of LLMs is the MoE architecture. By bringing together the strengths of multiple specialized models, acting as "experts," MoEs offer distinct advantages. These advantages include improved accuracy, superior generalization capabilities, and enhanced scalability. Each expert is meticulously trained on a specific dataset, honing its expertise in a particular domain. When a prompt is received, the system intelligently routes it to the most relevant experts, leveraging their combined knowledge to deliver an exceptional response.
One such MoE model is the impressive Mixtral 8x7B, developed by Mistral AI. This post delves into how the NVIDIA H100 GPUs, powered by the NVIDIA Hopper architecture, and TensorRT-LLM software work in concert to deliver phenomenal performance for Mixtral 8x7B.
Optimizing Mixtral 8x7B Performance with NVIDIA H100 and TensorRT-LLM
Cloud service providers serving LLMs at scale often set response time targets. To optimize performance within these constraints, they group user queries into batches. TensorRT-LLM incorporates in-flight batching, a technique that replaces completed requests with new ones during the LLM serving process, further enhancing efficiency.
Finding the optimal balance between throughput and user interactivity is crucial. Throughput refers to the number of requests processed per second, while user interactivity reflects how responsive the system feels. Fortunately, plots depicting throughput versus latency can be invaluable tools in selecting the ideal deployment scenario. There's often a sweet spot in this curve where significant throughput gains can be achieved with minimal increases in response time. Targeting latency within this zone for production deployments can lead to exceptional user experiences without incurring excessive costs.
The provided charts illustrate the measured throughput of Mixtral 8x7B on two H100 GPUs running TensorRT-LLM, using both FP16 and FP8 precisions. FP8, a data format supported by TensorRT-LLM software and the NVIDIA Hopper architecture, offers a significant throughput boost. It translates to nearly 50% higher throughput for the H100 GPU within a 0.5-second response time limit. This empowers developers to make a choice: either elevate throughput while maintaining user experience but at a reduced cost, or uphold the same throughput but deliver an even faster response time, enhancing the user experience without a significant cost increase.
Pushing the Limits: Streaming Mode and Latency-Unconstrained Scenarios
We also explore performance in streaming mode, where results are reported incrementally as output tokens are produced, rather than waiting for the entire request to process. This allows us to examine the time taken per output token. Here too, the combination of NVIDIA H100 GPUs and TensorRT-LLM shines. Even with an exceptionally low average time per output token, signifying a very rapid stream of responses for the user, the system maintains high throughput. For instance, a pair of H100 GPUs running TensorRT-LLM with FP8 precision achieves a throughput of 38.4 requests per second with a mean time per output token of just 0.016 seconds – that's over 60 tokens zipping across the screen every second for each user! Once again, FP8 offers a path to either improve responsiveness or serve more users at a given responsiveness level, reducing costs.
Finally, to gauge peak achievable throughput, we examine performance in scenarios without latency constraints. While online scenarios are more common for real-time use cases, offline scenarios like data labeling or sentiment analysis can be valuable benchmarks. The provided table showcases offline throughput at various batch sizes. As the batch size grows, the workload becomes increasingly compute-intensive, further amplifying the benefits of the superior FP8 throughput delivered by the Hopper architecture. Additionally, FP8 reduces memory footprint, enabling the processing of even larger batches. At a batch size of 1,024, the inference throughput reaches a remarkable 21,000 tokens/second with FP8.
0 Comments