Using LangChain with vLLM: Complete Tutorial
Posted: Oct 16, 2024.
LangChain offers tools for building complex chains of operations, while vLLM specializes in efficient model inference. Together, they simplify and accelerate the development of intelligent LLM applications.
In this tutorial, we'll cover how to use LangChain with vLLM; everything from setup to distributed inference and quantization.
We will also look into examples, best practices, and tips that will help you get the most out of these tools.
What is vLLM?
vLLM is a fast and easy-to-use library designed for inference and serving large language models. It stands out from other libraries due to several key features that contribute to its high efficiency and scalability:
-
State-of-the-art serving throughput: This makes vLLM highly efficient when handling a significant number of requests.
-
Efficient attention management with PagedAttention: It effectively manages memory, making sure that keys and values in attention are handled in an optimal way.
-
Continuous batching of incoming requests: This means more efficient processing, allowing vLLM to handle multiple tasks simultaneously without a drop in performance.
-
Optimized CUDA kernels: Leveraging optimized GPU kernels makes the whole process even faster, ensuring that inference is not only accurate but also quick.
These allow vLLM to provide extremely fast LLM inference and also supports features like distributed inference and quantization, making it suitable for a wide range of applications.
Installation and Setup
Keep in mind that vLLM requires:
-
Operating System: Linux
-
Python Version: Python >= 3.8
-
GPU Requirements: A GPU with compute capability >= 7.0 (e.g., V100, T4, RTX20xx, A100, L4, H100).
-
CUDA Version: vLLM is compiled with CUDA 12.1. Make sure your system is running this version.
If you are not running CUDA 12.1, you can either install a version of vLLM compiled for your CUDA version or upgrade your CUDA to version 12.1.
Before proceeding, it is recommended to perform some basic checks to ensure everything is installed correctly. You can do this by running the following command to verify that PyTorch is working with CUDA:
vLLM is a Python library that also contains pre-compiled C++ with CUDA (12.1) binaries. However, If you need CUDA 11.8, you can use the following command to install a compatible version:
Docker Installation
For those facing issues building vLLM or dealing with CUDA compatibility, using the NVIDIA PyTorch Docker image is recommended. It provides a pre-configured environment with the correct versions of CUDA and other dependencies:
The integration process finally starts with installing the required packages. We recommend upgrading vLLM to the latest version to avoid compatibility issues and benefit from the most recent improvements and features.
Configuring vLLM to work with LangChain
Now that the dependencies are installed, we can set up vLLM and connect it to LangChain. To do this, we will import VLLM from the LangChain community integrations. The example below demonstrates how to initialize a model with the vLLM library and integrate it with LangChain.
Here's a list of parameters to keep into consideration while using vLLM with LangChain:
Parameter Name | Description |
---|---|
model | The name or path of a Hugging Face Transformers model to use. |
top_k | Limits the sampling pool to the top k tokens, improving diversity. Default is -1. |
top_p | Uses cumulative probability to determine which tokens to consider, supporting more coherent outputs. Default is 1.0. |
trust_remote_code | Allows the model to execute remote code, useful for some Hugging Face models. Default is False. |
temperature | Controls the randomness of sampling, with higher values leading to more diverse outputs. Default is 1.0. |
max_new_tokens | Specifies the maximum number of tokens to generate per output sequence. Default is 512. |
callbacks | Callbacks to add to the run trace, useful for adding logging or monitoring functions during generation. |
tags | Tags to add to the run trace for categorization and easier debugging. |
tensor_parallel_size | Number of GPUs to use for distributed tensor-parallel execution. Default is 1. |
use_beam_search | Whether to use beam search instead of sampling for more optimized sequence generation. Default is False. |
vllm_kwargs | Holds additional parameters valid for the vLLM LLM call that are not explicitly specified. |
In this example, we load the MosaicML MPT-7B model and configure parameters like max_new_tokens
, top_k
, and temperature
.
These settings influence how the model generates text.
Creating chains Using LangChain and vLLM
One of LangChain’s core features is the ability to create chains of operations, allowing more complex interactions. We can easily integrate the vLLM model into an LLMChain, providing even more flexibility.
Such detailed outputs are particularly useful in scenarios where step-by-step reasoning is required, for example, in educational applications, detailed question-answering systems, or automated customer support where users need detailed responses.
Utilizing Multi-GPU Inference for Scaling
If you are working with locally hosted large models, you might want to leverage multiple GPUs for inference. Especially for high-throughput systems that need to process many requests simultaneously. vLLM allows just that: distributed tensor-parallel inference, to help in scaling operations.
To run multi-GPU inference, use the tensor_parallel_size
parameter while initializing the VLLM class.
This method is highly recommended for larger models like mosaicml/mpt-30b
, which can be computationally intensive and too slow to run on a single GPU.
Leveraging Quantization for Improved Efficiency
Quantization is an effective technique for improving the performance of language models by reducing memory usage and speeding up computations.
vLLM supports the AWQ quantization format. To enable it, pass the quantization option through the vllm_kwargs
parameter.
Quantization allows for deploying LLMs in resource-constrained environments, such as edge devices or older GPUs, without sacrificing too much accuracy.
In this example, the TheBloke/Llama-2-7b-Chat-AWQ
model has been quantized for optimal performance.
This feature is particularly valuable when deploying applications to production where cost and resource efficiency are critical.
Conclusion
By utilizing distributed GPU support, advanced quantization techniques, and maintaining API compatibility, you can create systems that not only deliver exceptional performance but also remain flexible for diverse business needs.
As you continue your journey with large language models using LangChain and vLLM, it's important to remember that continuous optimization and monitoring are key to achieving peak efficiency.
For instance, vLLM's CUDA-optimized kernels and continuous batching strategies can significantly reduce response times.
However, in production systems and especially user-facing ones like chatbots, it’s essential to monitor real-time inference latency.
Join 10,000+ subscribers
Every 2 weeks, latest model releases and industry news.
Building an AI chatbot?
Open-source GenAI monitoring, prompt management, and magic.