Loading LangSmith Chat Datasets in LangChain

Posted: Nov 18, 2024.

LangSmith is LangChain's platform for tracking, monitoring and managing LLM applications. When working with chat data in LangSmith, you'll often need to load chat sessions for analysis or fine-tuning purposes. The LangSmithDatasetChatLoader makes this process straightforward.

What is LangSmithDatasetChatLoader?

The LangSmithDatasetChatLoader is a utility class that helps you load chat sessions from LangSmith datasets. It's particularly useful when you need to:

Load chat data for model fine-tuning
Analyze chat conversations stored in LangSmith
Process chat history from your LangSmith experiments

The loader supports both eager loading (loading all data at once) and lazy loading (loading data as needed) approaches.

Reference

Method	Description
`__init__(dataset_name: str, client: Optional[Client] = None)`	Initializes the loader with a dataset name and optional LangSmith client
`lazy_load()`	Returns an iterator of chat sessions, loading them one at a time
`load()`	Loads all chat sessions at once into memory

How to Use LangSmithDatasetChatLoader

Basic Setup

First, make sure you have your LangSmith environment properly configured:

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"

from langchain_community.chat_loaders.langsmith import LangSmithDatasetChatLoader

Loading Chat Sessions

There are two main ways to load chat sessions:

1. Lazy Loading (Memory Efficient)

Use lazy loading when dealing with large datasets or when you want to process sessions one at a time:

loader = LangSmithDatasetChatLoader(dataset_name="my_chat_dataset")
chat_sessions = loader.lazy_load()

# Process sessions one at a time
for session in chat_sessions:
    print(f"Processing session with {len(session.messages)} messages")

2. Eager Loading (All at Once)

When you need all the data at once and memory isn't a concern:

loader = LangSmithDatasetChatLoader(dataset_name="my_chat_dataset")
all_sessions = loader.load()

print(f"Loaded {len(all_sessions)} chat sessions")

Using with Custom LangSmith Client

If you need to use a specific LangSmith client configuration:

from langsmith.client import Client

# Initialize custom client
client = Client(
    api_url="custom_url",
    api_key="your_key"
)

# Use custom client with loader
loader = LangSmithDatasetChatLoader(
    dataset_name="my_dataset",
    client=client
)

Example: Fine-tuning Workflow

Here's a complete example showing how to use the loader in a model fine-tuning workflow:

from langchain_community.chat_loaders.langsmith import LangSmithDatasetChatLoader
from langchain_community.adapters.openai import convert_messages_for_finetuning

# 1. Load the chat sessions
loader = LangSmithDatasetChatLoader(dataset_name="training_dataset")
chat_sessions = loader.lazy_load()

# 2. Convert to training format
training_data = convert_messages_for_finetuning(chat_sessions)

# 3. Prepare for fine-tuning
import json
from io import BytesIO
import openai

# Create training file
my_file = BytesIO()
for dialog in training_data:
    my_file.write((json.dumps({"messages": dialog}) + "\n").encode("utf-8"))
my_file.seek(0)

# Upload and start fine-tuning
training_file = openai.files.create(file=my_file, purpose="fine-tune")
job = openai.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo"
)

The LangSmithDatasetChatLoader simplifies the process of working with chat data stored in LangSmith, making it easier to integrate with other LangChain components and external tools for tasks like model fine-tuning, analysis, or testing.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs