Skip to main content

Caching

Cache LLM Responses

Quick Start

Caching can be enabled by adding the cache key in the config.yaml

Step 1: Add cache to the config.yaml

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
- model_name: text-embedding-ada-002
litellm_params:
model: text-embedding-ada-002

litellm_settings:
set_verbose: True
cache: True # set cache responses to True, litellm defaults to using a redis cache

Step 2: Add Redis Credentials to .env

Set either REDIS_URL or the REDIS_HOST in your os environment, to enable caching.

REDIS_URL = ""        # REDIS_URL='redis://username:password@hostname:port/database'
## OR ##
REDIS_HOST = "" # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'

Additional kwargs
You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:

REDIS_<redis-kwarg-name> = ""

See how it's read from the environment

Step 3: Run proxy with config

$ litellm --config /path/to/config.yaml

Using Caching - /chat/completions

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

Using Caching - /embeddings

Send the same request twice:

curl --location 'http://0.0.0.0:8000/embeddings' \
--header 'Content-Type: application/json' \
--data ' {
"model": "text-embedding-ada-002",
"input": ["write a litellm poem"]
}'

curl --location 'http://0.0.0.0:8000/embeddings' \
--header 'Content-Type: application/json' \
--data ' {
"model": "text-embedding-ada-002",
"input": ["write a litellm poem"]
}'

Advanced

Set Cache Params on config.yaml

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
- model_name: text-embedding-ada-002
litellm_params:
model: text-embedding-ada-002

litellm_settings:
set_verbose: True
cache: True # set cache responses to True, litellm defaults to using a redis cache
cache_params: # cache_params are optional
type: "redis" # The type of cache to initialize. Can be "local" or "redis". Defaults to "local".
host: "localhost" # The host address for the Redis cache. Required if type is "redis".
port: 6379 # The port number for the Redis cache. Required if type is "redis".
password: "your_password" # The password for the Redis cache. Required if type is "redis".

# Optional configurations
supported_call_types: ["acompletion", "completion", "embedding", "aembedding"] # defaults to all litellm call types

Cache-Controls on requests

Set ttl per request by passing Cache-Controls. The proxy currently supports just s-maxage.

Comment on this issue if you need additional cache controls - https://github.com/BerriAI/litellm/issues/1218

const { OpenAI } = require('openai');

const openai = new OpenAI({
apiKey: "sk-1234", // This is the default and can be omitted
baseURL: "http://0.0.0.0:8000"
});

async function main() {
const chatCompletion = await openai.chat.completions.create({
messages: [{ role: 'user', content: 'Say this is a test' }],
model: 'gpt-3.5-turbo',
}, {"headers": {
"Cache-Control": "s-maxage=0" // 👈 sets ttl=0
}});
}

main();

Override caching per chat/completions request

Caching can be switched on/off per /chat/completions request

  • Caching on for individual completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": true
    }'
  • Caching off for individual completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": false
    }'

Override caching per /embeddings request

Caching can be switched on/off per /embeddings request

  • Caching on for embedding - pass caching=True:
    curl --location 'http://0.0.0.0:8000/embeddings' \
    --header 'Content-Type: application/json' \
    --data ' {
    "model": "text-embedding-ada-002",
    "input": ["write a litellm poem"],
    "caching": true
    }'
  • Caching off for completion - pass caching=False:
      curl --location 'http://0.0.0.0:8000/embeddings' \
    --header 'Content-Type: application/json' \
    --data ' {
    "model": "text-embedding-ada-002",
    "input": ["write a litellm poem"],
    "caching": false
    }'