Retrieving OpenAI Fine-Tuning Results in Python: A Complete Guide with Traceability

Posted: Feb 5, 2025.

Fine-tuning OpenAI models allows developers to customize AI solutions for specific tasks, enhancing performance on specialized datasets. However, analyzing the results and maintaining visibility into the fine-tuning process can be challenging.

In this guide, we'll explore how to retrieve and analyze fine-tuning results using Python, while implementing proper traceability and observability practices.

Prerequisites

Before diving into retrieving results, ensure you have:

  • Prepared your dataset: Your dataset should be in JSONL format, with each line containing a JSON object that includes a 'prompt' and a 'completion' field.
  • Uploaded your dataset to OpenAI: Use the OpenAI API to upload your dataset for fine-tuning.
  • Initiated the fine-tuning process: Use the OpenAI API to start the fine-tuning process.

Retrieving Fine-Tuned Model Results

Once the fine-tuning process is complete, you can retrieve the results using the OpenAI API. Here’s how to do it:

1. Get the Fine-Tune Job ID

After initiating the fine-tuning process, you will receive a job ID. This ID is crucial for tracking the status of your fine-tuning job and retrieving the results.

from openai import OpenAI

client = OpenAI(api_key=API_KEY)

# Create a fine-tune job
job = client.fine_tuning.jobs.create(
    training_file="file-WcJytYwxhGJFZBGpJtGytT",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)

# Get the fine-tune job ID
job_id = job.id

2. Check the Fine-Tune Job Status

Use the job ID to check the status of your fine-tuning job. The status will indicate whether the job is pending, in progress, or completed.

# Check the fine-tune job status
status_response = client.fine_tuning.jobs.retrieve(job_id)
print(status_response.status)

3. Download the Fine-Tuned Model Results

After your model finishes training, you can check how well it performed during the training process. Let's look at how to get and understand these results:

# Get the results file from your completed fine-tuning job
job = client.fine_tuning.jobs.retrieve(job_id)
result_file_id = job.result_files[0]  # Get the first results file

# Download the contents of the results file
file_content = client.files.retrieve_content(result_file_id)

The results file is a CSV (spreadsheet) that includes these important numbers:

MetricDescription
stepShows which training step the model is on
train_lossHow many mistakes the model is making during training
train_accuracyHow often the model gets things right during training
valid_lossHow many mistakes on new data the model hasn't seen
valid_mean_token_accuracyHow often the model gets things right on new data

Measuring Fine-Tuning Success with Lunary

Lunary offers essential monitoring capabilities for fine-tuned models, helping track their real-world performance and ROI:

Before Fine-Tuning (model : gpt-4o)

from openai import OpenAI
import lunary

client = OpenAI()

lunary.monitor(client)  # This line sets up monitoring for all calls made through the 'openai' module

chat_completion = client.chat.completions.create(
  model="gpt-4o", 
  messages=[{"role": "user", "content": "Hello world"}]  
)

After Fine-Tuning (model : fine-tuned-v1)

from openai import OpenAI
import lunary

client = OpenAI()

lunary.monitor(client)  # This line sets up monitoring for all calls made through the 'openai' module

chat_completion = client.chat.completions.create(
  model="fine-tuned-v1", 
  messages=[{"role": "user", "content": "Hello world"}]  
)

Key Metrics Tracked:

  • Cost comparison between base and fine-tuned models
  • Response accuracy improvements
  • Token usage optimization
  • Error rates and types
  • User feedback and satisfaction scores

Through Lunary's dashboard, you can validate if fine-tuning actually improved your model's performance and justify the investment with concrete metrics.

FAQ

Q: How do I handle the OpenAI API version compatibility issues?

A: Use the new client-based approach (client = OpenAI()) or pin to older version with pip install openai==0.28 (not recommended).

Q: How can I optimize my dataset for better fine-tuning results?

A: Aim for 50-100 diverse, high-quality examples with consistent formatting and balanced representation of all behaviors.

Q: What should I do if my fine-tuned model isn't performing as expected?

A: First analyze training metrics for overfitting/underfitting, then review data quality and consider increasing dataset diversity.

Q: How do I handle sensitive data in fine-tuning jobs?

A: Never include PII in training data and implement proper encryption, API key rotation, and access monitoring.

Q: What are the cost implications of fine-tuning?

A: Costs vary by model size and epochs; monitor usage carefully and batch training runs when possible to optimize costs.

Conclusion

Fine-tuning models is a journey of continuous improvement. Combining the metrics with systematic testing and careful tracking of what works will help in taking the right decision.

Use tools like Lunary to maintain visibility on how your LLM is behaving in real world, So that you can fine-tune your model exactly as per need.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More