LLMs 的语义缓存

RedisVL 提供了一个 SemanticCache 接口，利用 Redis 的内置缓存能力和向量搜索来存储先前回答过的问题的响应。这减少了发送给大型语言模型 (LLM) 服务的请求和 token 数量，降低了成本并提高了应用程序吞吐量（通过减少生成响应所需的时间）。

本 notebook 将介绍如何使用 Redis 作为应用程序的语义缓存

首先，我们将导入 OpenAI 以使用其 API 响应用户提示。我们还将创建一个简单的 ask_openai 辅助方法来提供帮助。

import os
import getpass
import time
import numpy as np

from openai import OpenAI


os.environ["TOKENIZERS_PARALLELISM"] = "False"

api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
      model="gpt-3.5-turbo-instruct",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

# Test
print(ask_openai("What is the capital of France?"))

The capital of France is Paris.

初始化 `SemanticCache`

SemanticCache 在初始化时会自动在 Redis 中创建一个索引，用于存储语义缓存内容。

from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",                     # underlying search index name
    redis_url="redis://:6379",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)

22:11:38 redisvl.index.index INFO   Index already exists, not overwriting.

# look at the index specification created for the semantic cache lookup
!rvl index info -i llmcache

Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ llmcache     │ HASH           │ ['llmcache'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name          │ Attribute     │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├───────────────┼───────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────────────────┼────────────────┤
│ prompt        │ prompt        │ TEXT    │ WEIGHT         │ 1              │                │                │                │                │                 │                │
│ response      │ response      │ TEXT    │ WEIGHT         │ 1              │                │                │                │                │                 │                │
│ inserted_at   │ inserted_at   │ NUMERIC │                │                │                │                │                │                │                 │                │
│ updated_at    │ updated_at    │ NUMERIC │                │                │                │                │                │                │                 │                │
│ prompt_vector │ prompt_vector │ VECTOR  │ algorithm      │ FLAT           │ data_type      │ FLOAT32        │ dim            │            768 │ distance_metric │ COSINE         │
╰───────────────┴───────────────┴─────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴─────────────────┴────────────────╯

基本缓存使用

question = "What is the capital of France?"

# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")

Empty cache

我们的初始缓存检查应该是空的，因为我们还没有在缓存中存储任何内容。下面，在缓存中存储 question、相应的 response 以及任何任意的 metadata（作为 python 字典对象）。

# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response="Paris",
    metadata={"city": "Paris", "country": "france"}
)

'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'

现在，我们将使用相同的问题和语义相似的问题再次检查缓存

# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)
else:
    print("Empty cache")

[{'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}, 'key': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'}]

# Check for a semantically similar result
question = "What actually is the capital of France?"
llmcache.check(prompt=question)[0]['response']

'Paris'

自定义距离阈值

对于大多数用例，合适的语义相似度阈值不是一个固定值。根据嵌入模型的选择、输入查询的属性，甚至业务用例，阈值可能需要改变。

幸运的是，您可以随时无缝调整阈值，如下所示

# Widen the semantic distance threshold
llmcache.set_threshold(0.3)

# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in Europe that also has a city named Nice?"
llmcache.check(prompt=question)[0]['response']

'Paris'

# Invalidate the cache completely by clearing it out
llmcache.clear()

# should be empty now
llmcache.check(prompt=question)

[]

利用 TTL

Redis 使用 TTL 策略（可选）在未来某个时间点使单个 key 过期。这让您可以专注于您的数据流和业务逻辑，而无需处理复杂的清理任务。

在 SemanticCache 上设置 TTL 策略允许您临时保留缓存条目。下面，我们将 TTL 策略设置为 5 秒。

llmcache.set_ttl(5) # 5 seconds

llmcache.store("This is a TTL test", "This is a TTL test response")

time.sleep(6)

# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")

print(result)

[]

# Reset the TTL to null (long lived data)
llmcache.set_ttl()

简单的性能测试

接下来，我们将测量使用 SemanticCache 获得的加速。我们将使用 time 模块来测量使用和不使用 SemanticCache 生成响应所需的时间。

def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer

start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()

print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")

# add the entry to our LLM cache
llmcache.store(prompt=question, response="George Washington")

Without caching, a call to openAI to answer this simple question took 0.9034533500671387 seconds.





'llmcache:67e0f6e28fe2a61c0022fd42bf734bb8ffe49d3e375fd69d692574295a20fc1a'

# Calculate the avg latency for caching over LLM usage
times = []

for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")

Avg time taken with LLM cache enabled: 0.09753389358520508
Percentage of time saved: 89.2%

# check the stats of the index
!rvl stats -i llmcache

Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key                    │ Value       │
├─────────────────────────────┼─────────────┤
│ num_docs                    │ 1           │
│ num_terms                   │ 19          │
│ max_doc_id                  │ 6           │
│ num_records                 │ 53          │
│ percent_indexed             │ 1           │
│ hash_indexing_failures      │ 0           │
│ number_of_uses              │ 45          │
│ bytes_per_record_avg        │ 45.0566     │
│ doc_table_size_mb           │ 0.000134468 │
│ inverted_sz_mb              │ 0.00227737  │
│ key_table_size_mb           │ 2.76566e-05 │
│ offset_bits_per_record_avg  │ 8           │
│ offset_vectors_sz_mb        │ 3.91006e-05 │
│ offsets_per_term_avg        │ 0.773585    │
│ records_per_doc_avg         │ 53          │
│ sortable_values_size_mb     │ 0           │
│ total_indexing_time         │ 19.454      │
│ total_inverted_index_blocks │ 21          │
│ vector_index_sz_mb          │ 3.0161      │
╰─────────────────────────────┴─────────────╯

# Clear the cache AND delete the underlying index
llmcache.delete()

缓存访问控制、标签和过滤器

当运行具有相似应用程序的复杂工作流或处理多个用户时，保持数据隔离非常重要。基于 RedisVL 对复杂混合查询的支持，我们可以使用自定义的 filterable_fields 对缓存条目进行标记和过滤。

让我们在缓存中存储具有相似提示的多个用户的数据，并确保只返回正确的用户信息

private_cache = SemanticCache(
    name="private_cache",
    filterable_fields=[{"name": "user_id", "type": "tag"}]
)

private_cache.store(
    prompt="What is the phone number linked to my account?",
    response="The number on file is 123-555-0000",
    filters={"user_id": "abc"},
)

private_cache.store(
    prompt="What's the phone number linked in my account?",
    response="The number on file is 123-555-1111",
    filters={"user_id": "def"},
)

'private_cache:5de9d651f802d9cc3f62b034ced3466bf886a542ce43fe1c2b4181726665bf9c'

from redisvl.query.filter import Tag

# define user id filter
user_id_filter = Tag("user_id") == "abc"

response = private_cache.check(
    prompt="What is the phone number linked to my account?",
    filter_expression=user_id_filter,
    num_results=2
)

print(f"found {len(response)} entry \n{response[0]['response']}")

found 1 entry 
The number on file is 123-555-0000

# Cleanup
private_cache.delete()

可以在缓存上定义多个 filterable_fields，并且可以构建复杂的过滤表达式来过滤这些字段以及已有的默认字段。


complex_cache = SemanticCache(
    name='account_data',
    filterable_fields=[
        {"name": "user_id", "type": "tag"},
        {"name": "account_type", "type": "tag"},
        {"name": "account_balance", "type": "numeric"},
        {"name": "transaction_amount", "type": "numeric"}
    ]
)
complex_cache.store(
    prompt="what is my most recent checking account transaction under $100?",
    response="Your most recent transaction was for $75",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 75},
)
complex_cache.store(
    prompt="what is my most recent savings account transaction?",
    response="Your most recent deposit was for $300",
    filters={"user_id": "abc", "account_type": "savings", "transaction_amount": 300},
)
complex_cache.store(
    prompt="what is my most recent checking account transaction over $200?",
    response="Your most recent transaction was for $350",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 350},
)
complex_cache.store(
    prompt="what is my checking account balance?",
    response="Your current checking account is $1850",
    filters={"user_id": "abc", "account_type": "checking"},
)

'account_data:d48ebb3a2efbdbc17930a8c7559c548a58b562b2572ef0be28f0bb4ece2382e1'

from redisvl.query.filter import Num

value_filter = Num("transaction_amount") > 100
account_filter = Tag("account_type") == "checking"
complex_filter = value_filter & account_filter

# check for checking account transactions over $100
complex_cache.set_threshold(0.3)
response = complex_cache.check(
    prompt="what is my most recent checking account transaction?",
    filter_expression=complex_filter,
    num_results=5
)
print(f'found {len(response)} entry')
print(response[0]["response"])

found 1 entry
Your most recent transaction was for $350

# Cleanup
complex_cache.delete()

产品

工具

获取 Redis

连接

学习

最新

了解工作原理

LLMs 的语义缓存

初始化 `SemanticCache`

基本缓存使用

自定义距离阈值

利用 TTL

简单的性能测试

缓存访问控制、标签和过滤器

本页内容

LLMs 的语义缓存

初始化 SemanticCache

基本缓存使用

自定义距离阈值

利用 TTL

简单的性能测试

缓存访问控制、标签和过滤器

本页内容

初始化 `SemanticCache`