Using Langchain with Llama.cpp Python: Complete Tutorial

Posted: Nov 5, 2024.

Llama.cpp is a high-performance tool for running language model inference on various hardware configurations. This capability is further enhanced by the llama-cpp-python Python bindings which provide a seamless interface between Llama.cpp and Python.

These bindings allow for both low-level C API access and high-level Python APIs.

In this tutorial we will see how to effectively integrate LangChain with Llama.cpp to run large language models on your hardware without requiring an internet connection.

What is Llama.cpp?

Llama.cpp enables LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware optimized for various architectures including Apple silicon, x86, and NVIDIA GPUs.

It supports multiple quantization levels for efficient inference, offering hybrid CPU+GPU inference and various backend supports.

With its Python wrapper llama-cpp-python, Llama.cpp integrates with Python-based tools to perform model inference easily with Langchain.

Installing Llama-cpp-python

To use Llama models with LangChain you need to set up the llama-cpp-python library. Installation options vary depending on your hardware.

Installation Type	Command	Description
CPU-Only Installation	`pip install llama-cpp-python`	Basic setup for CPU-only processing.
BLAS Backend Installation	`CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python`	Faster processing with GPU support.
Windows Compilation	Follow the official documentation	Requires Visual Studio, CMake, etc.

Getting Started with LangChain and Llama.cpp

Begin by installing packages:

pip install langchain llama-cpp-python langchain-community

You will need to manually download the appropriate model file and place it at the specified path for use. Please check the list of llama.cpp supported models

New versions of llama-cpp-python now use GGUF model files. This change is significant and may require some updates on your part.

If you need to convert your existing GGML models to GGUF, you can do so using the following script by llama.cpp

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 \ 
--input models/openorca-platypus2-13b.ggmlv3.q4_0.bin \
--output models/openorca-platypus2-13b.gguf.q4_0.bin

Creating a Simple LLM Chain

To create an LLM using LangChain and Llama.cpp, use the LlamaCpp module from the langchain_community.llms package

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

# Define the model path
model_path = "/path/to/your/model/openorca-platypus2-13b.gguf.q4_0.bin"

# Set up callback manager
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Create the LLM object
llm = LlamaCpp(
    model_path=model_path,
    temperature=0.7,
    max_tokens=200,
    top_p=1.0,
    callback_manager=callback_manager,
    verbose=True
)

# Example usage
question = "What is bindings in programming languages?"
response = llm.invoke({"text" : question})
print(response)

Key Model Parameters

Parameter	Description
`temperature`	Controls randomness. Lower values produce more deterministic results.
`max_tokens`	Maximum number of tokens to generate. Helps limit response length.
`top_p`	Controls output diversity. Higher values make output more varied.

Adjust these parameters to fine-tune the model's behavior based on your needs.

Running Models on Different Hardware

Using GPU with cuBLAS Backend

If you have an NVIDIA GPU then you can enable the cuBLAS backend for faster processing:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python

In Python, configure the LLM to use GPU:

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=-1,
    n_batch=512,
    callback_manager=callback_manager,
    verbose=True
)

Running on Apple Silicon (Metal Backend)

For Mac users with Apple Silicon, Llama.cpp supports the Metal backend for GPU optimization:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

In Python code enable Metal-specific optimizations:

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=1,
    f16_kv=True,
    callback_manager=callback_manager,
    verbose=True
)

Considerations for Choosing Models

Model Size and Compatibility

Choose models based on your hardware capabilities. Larger models require more resources. Quantized models can help if you have limited hardware but they may sacrifice some accuracy.

Model Variant	Memory Requirement	Use Case
LLaMA 7B	Moderate	Basic Q&A and small applications.
LLaMA 13B	High	More detailed responses.
LLaMA 30B	Very High	Complex tasks, best with a GPU.

Power Consumption

Consider the power requirements of the model if you're running it on battery-powered devices or edge hardware. Smaller models typically have lower power demands.

Model Updates

Larger models may be updated less frequently so you may need to balance the benefits of the latest model against compatibility and retraining efforts.

Troubleshooting and Common Issues

Issue	Solution
CUDA/BLAS Issues	Ensure proper installation of GPU drivers and CUDA toolkit.
Model Load Errors	Verify model path and format compatibility (usually GGUF).
Slow Performance	Adjust `n_gpu_layers` or `n_batch` for better efficiency.

Experiment with different configurations and models to find what works best for your needs.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More