← Back to blogUsing Langchain with vLLM: Complete Tutorial

Using Langchain with vLLM: Complete Tutorial

Oct 16, 2024.

Langchain offers tools for building complex chains of operations, while vLLM specializes in efficient model inference. Together, they simplify and accelerate the development of intelligent LLM applications.

In this tutorial, we'll cover how to use Langchain with vLLM; everything from setup to distributed inference and quantization.

We will also look into examples, best practices, and tips that will help you get the most out of these tools.

What is vLLM?

vLLM is a fast and easy-to-use library designed for inference and serving large language models. It stands out from other libraries due to several key features that contribute to its high efficiency and scalability:

  • State-of-the-art serving throughput: This makes vLLM highly efficient when handling a significant number of requests.

  • Efficient attention management with PagedAttention: It effectively manages memory, making sure that keys and values in attention are handled in an optimal way.

  • Continuous batching of incoming requests: This means more efficient processing, allowing vLLM to handle multiple tasks simultaneously without a drop in performance.

  • Optimized CUDA kernels: Leveraging optimized GPU kernels makes the whole process even faster, ensuring that inference is not only accurate but also quick.

These allow vLLM to provide extremely fast LLM inference and also supports features like distributed inference and quantization, making it suitable for a wide range of applications.

Installation and setup

Keep in mind that vLLM requires:

  • Operating System: Linux

  • Python Version: Python >= 3.8

  • GPU Requirements: A GPU with compute capability >= 7.0 (e.g., V100, T4, RTX20xx, A100, L4, H100).

  • CUDA Version: vLLM is compiled with CUDA 12.1. Make sure your system is running this version.

If you are not running CUDA 12.1, you can either install a version of vLLM compiled for your CUDA version or upgrade your CUDA to version 12.1.

Before proceeding, it is recommended to perform some basic checks to ensure everything is installed correctly. You can do this by running the following command to verify that PyTorch is working with CUDA:

# Ensure torch is working with CUDA, this should print: True
python -c 'import torch; print(torch.cuda.is_available())'

vLLM is a Python library that also contains pre-compiled C++ with CUDA (12.1) binaries. However, If you need CUDA 11.8, you can use the following command to install a compatible version:

# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Docker Installation

For those facing issues building vLLM or dealing with CUDA compatibility, using the NVIDIA PyTorch Docker image is recommended. It provides a pre-configured environment with the correct versions of CUDA and other dependencies:

# Use `--ipc=host` to ensure the shared memory is sufficient
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

The integration process finally starts with installing the required packages. We recommend upgrading vLLM to the latest version to avoid compatibility issues and benefit from the most recent improvements and features.

pip install --upgrade --quiet vllm -q
pip install langchain langchain_community -q

Configuring vLLM to Work with Langchain

Now that the dependencies are installed, we can set up vLLM and connect it to Langchain. To do this, we will import VLLM from the Langchain community integrations. The example below demonstrates how to initialize a model with the vLLM library and integrate it with Langchain.

from langchain_community.llms import VLLM

# Initializing the vLLM model
llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for Hugging Face models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

# Running a simple query
print(llm.invoke("What are the most popular Halloween Costumes?"))

Here's a list of parameters to keep into consideration while using vLLM with Langchain:

Parameter NameDescription
modelThe name or path of a Hugging Face Transformers model to use.
top_kLimits the sampling pool to the top k tokens, improving diversity. Default is -1.
top_pUses cumulative probability to determine which tokens to consider, supporting more coherent outputs. Default is 1.0.
trust_remote_codeAllows the model to execute remote code, useful for some Hugging Face models. Default is False.
temperatureControls the randomness of sampling, with higher values leading to more diverse outputs. Default is 1.0.
max_new_tokensSpecifies the maximum number of tokens to generate per output sequence. Default is 512.
callbacksCallbacks to add to the run trace, useful for adding logging or monitoring functions during generation.
tagsTags to add to the run trace for categorization and easier debugging.
tensor_parallel_sizeNumber of GPUs to use for distributed tensor-parallel execution. Default is 1.
use_beam_searchWhether to use beam search instead of sampling for more optimized sequence generation. Default is False.
vllm_kwargsHolds additional parameters valid for the vLLM LLM call that are not explicitly specified.

In this example, we load the MosaicML MPT-7B model and configure parameters like max_new_tokens, top_k, and temperature. These settings influence how the model generates text.

Creating Workflow Chains Using Langchain and vLLM

One of Langchain’s core features is the ability to create chains of operations, allowing more complex interactions. We can easily integrate the vLLM model into an LLMChain, providing even more flexibility.

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

# Defining a prompt template for our LLMChain
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

# Creating an LLMChain with vLLM
llm_chain = LLMChain(prompt=prompt, llm=llm)

# Testing the LLMChain
question = "Who was the US president in the year the first Pokemon game was released?"
print(llm_chain.invoke(question))

Such detailed outputs are particularly useful in scenarios where step-by-step reasoning is required, for example, in educational applications, detailed question-answering systems, or automated customer support where users need detailed responses.

Utilizing Multi-GPU Inference for Scaling

If you are working with locally hosted large models, you might want to leverage multiple GPUs for inference. Especially for high-throughput systems that need to process many requests simultaneously. vLLM allows just that: distributed tensor-parallel inference, to help in scaling operations.

To run multi-GPU inference, use the tensor_parallel_size parameter while initializing the VLLM class.

from langchain_community.llms import VLLM

# Running inference on multiple GPUs
llm = VLLM(
    model="mosaicml/mpt-30b",
    tensor_parallel_size=4,  # using 4 GPUs
    trust_remote_code=True,
)

print(llm.invoke("What is the future of AI?"))

This method is highly recommended for larger models like mosaicml/mpt-30b, which can be computationally intensive and too slow to run on a single GPU.

Leveraging Quantization for Improved Efficiency

Quantization is an effective technique for improving the performance of language models by reducing memory usage and speeding up computations.

vLLM supports the AWQ quantization format. To enable it, pass the quantization option through the vllm_kwargs parameter. Quantization allows for deploying LLMs in resource-constrained environments, such as edge devices or older GPUs, without sacrificing too much accuracy.

llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"},
)

In this example, the TheBloke/Llama-2-7b-Chat-AWQ model has been quantized for optimal performance. This feature is particularly valuable when deploying applications to production where cost and resource efficiency are critical.

Conclusion

By utilizing distributed GPU support, advanced quantization techniques, and maintaining API compatibility, you can create systems that not only deliver exceptional performance but also remain flexible for diverse business needs.

As you continue your journey with large language models using Langchain and vLLM, it's important to remember that continuous optimization and monitoring are key to achieving peak efficiency.

For instance, vLLM's CUDA-optimized kernels and continuous batching strategies can significantly reduce response times.

However, in production systems and especially user-facing ones like chatbots, it’s essential to monitor real-time inference latency.

Are you building an AI product?

Lunary: open-source GenAI monitoring, prompt management, and magic.

Learn More