Inside OpenAI’s Prompt Caching: Faster Responses, Half the Cost

Posted: Jan 20, 2025.

OpenAI recently introduced prompt caching across supported models like GPT-4o, GPT-4.1, GPT-o3 and others, aiming to reduce latency and lower compute costs, especially for long, repetitive prompts.

This caching mechanism can store everything you send in a chat, including system messages, user inputs, and assistant replies. It even works with images (as long as they're exactly identical, including settings like resolution). If your prompt uses tools like functions or APIs, those are cached too. Fixed templates or structured inputs are also remembered and reused automatically.

The need for such optimization became urgent during the recent Ghibli trend, when OpenAI servers were overwhelmed by a flood of similar requests. To avoid a complete meltdown, OpenAI deployed a clever solution of caching the encoded input tokens of similar prompts.

This dramatically cut both latency and costs, allowing them to stabilize the platform under heavy load. What started as an emergency fix has now been rolled out as a standard feature, available for anyone using OpenAI's API models.

How Prompt Caching works?

Prompt caching operates by reusing encoded prefix segments of large prompts to avoid redundant computation.

Caching activates automatically for prompts with 1024+ tokens. The system checks the beginging of your prompt(prefix) and compare with the prefix that has been processed recently.

Cache Hit : If a match is found then cache result is reused which cuts latency by 80% and costs by 50%.
Cache Miss : If there's no match then full prompt is processed and it's prefix is stored for future use.

Cache prefix persist for 5-10 minutes of inactivity and up to 1 hour during off-peak times.

Prompt caching directly lowers your token spend at scale.

Assume you'd input similar prompt twice that compares two images.

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Compare the contents of these two images. What are the main elements, and how do they differ visually, structurally, or environmentally?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
              "detail": "high"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/3/31/Portal_Math_Banner_Background_ka.jpg",
              "detail": "high"
            }
          }
        ]
      }
    ],
    "max_tokens": 500
  }'

The Output for the given API request for second time is

"usage": {
    "prompt_tokens": 1564,
    "completion_tokens": 276,
    "total_tokens": 1840,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
}

Input Tokens : 1564 tokens

Model	Input Token Price	Cached Token Price	Total Tokens	Cached Tokens	Uncached Tokens
GPT-4o Fine-Tuning	$3.750 / 1M	$1.875 / 1M	1564	1024	540
GPT-4.1 Nano	$0.100 / 1M	$0.025 / 1M	1564	1024	540

The cost for two request without Caching would be

GPT-4o : 2 * 1564 * 3.75 / 1M = $0.01173
GPT-4.1 Nano : 2 * 1564 * 0.1 / 1M = $0.0003128

Now with input prompt caching the cost will come down to

GPT-4o : (1564+540) * 3.75 / 1M = $0.00789
GPT-4.1 Nano : (1564+540) * 0.1/ 1M = $0.000210

It is almost half of initial costs with cached response. With using GPT-4.1 Nano model it even comes down.

1024 tokens from the prompt were served from cache, saving both time and compute.

Does the OpenAI API cache responses or just input prompts?

The OpenAI API does not cache the responses, it caches only the encoded input prompt prefix which is reused. Responses are not cached because every response is generated from scratch, maintaining full output variability and freshness.

Want to see how many tokens your prompt uses and what it might cost? Try the Lunary OpenAI Tokenizer to break it down and start optimizing.

Can prompt caching be leveraged in fine-tuning or RAG workflows?

Prompt caching is not directly applicable during fine-tuning but it can be highly beneficial during inference in workflows like RAG.

Since each training example is processed independently so no cached context is resued. If you use fine-tuned model via API and your prompts are long and repetitive caching will be applied for more than 102t token threshold.

Developer experience and prompt engineering with Cache

Developers should begin structuring prompts with resuable static components first. They should design prompts with instructions, schemas or tool definition followed after dynamic user inputs.

This improves the cache hit rate for encoding and also increases code clarity.

With caching active by default, Developers can test and refine prompts more efficiently without incurring the full cost of repeated inputs.

Conclusion

Instead of making the system re-read the same long instructions over and over, OpenAI remembers the beginning of your prompt if it's long enough (1,024 tokens or more). This means it can skip reprocessing repeated content and go straight to generating the response, saving both time and money.

To get the most out of caching, keep your repeated or fixed content at the start of the prompt, and put unique or changing content at the end.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More

Join 10,000+ subscribers

Every 2 weeks, latest model releases and industry news.

Building an AI chatbot?

Open-source GenAI monitoring, prompt management, and magic.

Learn More