Preparing Messages for LangChain Fine-tuning with OpenAI

Posted: Nov 7, 2024.

When fine-tuning OpenAI models with chat data in LangChain, you need to convert your messages into a specific format. The convert_messages_for_finetuning function helps you transform chat messages into a format that OpenAI's fine-tuning API expects.

What is convert_messages_for_finetuning?

convert_messages_for_finetuning is a utility function that takes chat sessions and converts them into lists of dictionaries that match OpenAI's expected format for fine-tuning. This function is particularly useful when you want to fine-tune a model on conversational data from various sources like Facebook Messenger, iMessage, or LangSmith datasets.

Reference

Parameters:

Parameter	Type	Description
sessions	Iterable[ChatSession]	The chat sessions to convert. Each session contains messages with sender information and content.

Returns:

Type	Description
List[List[dict]]	A list where each inner list contains dictionaries representing the messages in a format suitable for OpenAI fine-tuning. Each dictionary has 'role' and 'content' keys.

How to use convert_messages_for_finetuning

Here are different ways to use this function:

Basic Usage with Chat Sessions

from langchain_community.adapters.openai import convert_messages_for_finetuning
from langchain_core.chat_sessions import ChatSession

# Assuming you have chat sessions loaded
chat_sessions = [ChatSession(messages=[...])]

# Convert messages for fine-tuning
training_data = convert_messages_for_finetuning(chat_sessions)

Using with Message Loader and Pre-processing

Often you'll want to pre-process your messages before converting them for fine-tuning. Here's how to do that with message loaders:

from langchain_community.chat_loaders.utils import map_ai_messages, merge_chat_runs

# Load messages from a source
raw_messages = loader.lazy_load()

# Merge consecutive messages from same sender
merged_messages = merge_chat_runs(raw_messages)

# Convert specific sender's messages to AI messages
chat_sessions = list(map_ai_messages(merged_messages, sender="AI"))

# Convert to fine-tuning format
training_data = convert_messages_for_finetuning(chat_sessions)

Preparing Data for OpenAI Fine-tuning

After converting the messages, you'll typically want to prepare them for OpenAI's fine-tuning API:

import json
from io import BytesIO
import openai

# Create a JSONL file in memory
training_file = BytesIO()
for messages in training_data:
    training_file.write((json.dumps({"messages": messages}) + "\n").encode("utf-8"))
training_file.seek(0)

# Upload to OpenAI
file = openai.files.create(
    file=training_file,
    purpose="fine-tune"
)

# Create fine-tuning job
job = openai.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-3.5-turbo"
)

Using with Different Data Sources

The function works with various chat loaders in LangChain. Here's an example with Facebook Messenger data:

from langchain_community.chat_loaders.facebook_messenger import SingleFileFacebookMessengerChatLoader

# Load Facebook messages
loader = SingleFileFacebookMessengerChatLoader(
    path="messages.json"
)
chat_sessions = loader.load()

# Convert messages
training_data = convert_messages_for_finetuning(chat_sessions)

And with iMessage data:

from langchain_community.chat_loaders.imessage import IMessageChatLoader

# Load iMessage chats
loader = IMessageChatLoader(path="chat.db")
chat_sessions = loader.load()

# Convert messages
training_data = convert_messages_for_finetuning(chat_sessions)

This function is a crucial component in the fine-tuning pipeline, helping you transform your chat data into a format that OpenAI's models can learn from effectively.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

An alternative to LangSmith

Open-source LangChain monitoring, prompt management, and magic. Get started in 2 minutes.

LangChain Docs