语义缓存
使用 RedisVL 进行语义缓存
在开始之前,请确保以下事项
- 您已安装 RedisVL 并已激活该环境。
- 您有一个运行的 Redis 实例,并具有搜索和查询功能。
LLM 的语义缓存
RedisVL 提供了一个 SemanticCache
接口,该接口使用 Redis 的内置缓存功能和向量搜索来存储先前回答问题的响应。这减少了发送到 LLM 服务的请求和令牌数量,降低了成本,并通过减少生成响应所需的时间来提高应用程序吞吐量。
本文档将教你如何将 Redis 用作应用程序的语义缓存。
首先导入 OpenAI,以便你可以使用其 API 来响应用户提示。你还会创建一个简单的 ask_openai
帮助方法来辅助。
import os
import getpass
import time
from openai import OpenAI
import numpy as np
os.environ["TOKENIZERS_PARALLELISM"] = "False"
api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
client = OpenAI(api_key=api_key)
def ask_openai(question: str) -> str:
response = client.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=question,
max_tokens=200
)
return response.choices[0].text.strip()
# Test
print(ask_openai("What is the capital of France?"))
The capital of France is Paris.
初始化 SemanticCache
初始化时,SemanticCache
将自动在 Redis 中为语义缓存内容创建索引。
from redisvl.extensions.llmcache import SemanticCache
llmcache = SemanticCache(
name="llmcache", # underlying search index name
prefix="llmcache", # redis key prefix for hash entries
redis_url="redis://127.0.0.1:6379", # redis connection url string
distance_threshold=0.1 # semantic cache distance threshold
)
# look at the index specification created for the semantic cache lookup
$ rvl index info -i llmcache
Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name │ Storage Type │ Prefixes │ Index Options │ Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ llmcache │ HASH │ ['llmcache'] │ [] │ 0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬────────┬────────────────┬────────────────╮
│ Name │ Attribute │ Type │ Field Option │ Option Value │
├───────────────┼───────────────┼────────┼────────────────┼────────────────┤
│ prompt │ prompt │ TEXT │ WEIGHT │ 1 │
│ response │ response │ TEXT │ WEIGHT │ 1 │
│ prompt_vector │ prompt_vector │ VECTOR │ │ │
╰───────────────┴───────────────┴────────┴────────────────┴────────────────╯
基本缓存使用
question = "What is the capital of France?"
# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
print(response)
else:
print("Empty cache")
Empty cache
你的初始缓存检查应该是空的,因为你尚未在缓存中存储任何内容。在下面,将 question
、正确的 response
和任何任意 metadata
(作为 Python 字典对象)存储在缓存中。
# Cache the question, answer, and arbitrary metadata
llmcache.store(
prompt=question,
response="Paris",
metadata={"city": "Paris", "country": "france"}
)
# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
print(response)
else:
print("Empty cache")
[{'id': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545', 'vector_distance': '9.53674316406e-07', 'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}}]
# Check for a semantically similar result
question = "What actually is the capital of France?"
llmcache.check(prompt=question)[0]['response']
'Paris'
自定义距离阈值
对于大多数用例,正确的语义相似性阈值不是固定值。根据嵌入模型的选择、输入查询的属性和业务用例,阈值可能需要更改。
幸运的是,你可以像下面所示的那样在任何时候无缝地调整阈值。
# Widen the semantic distance threshold
llmcache.set_threshold(0.3)
# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in Europe that also has a city named Nice?"
llmcache.check(prompt=question)[0]['response']
'Paris'
# Invalidate the cache completely by clearing it out
llmcache.clear()
# should be empty now
llmcache.check(prompt=question)
[]
使用 TTL
Redis 使用可选的生存时间 (TTL) 策略,以便在将来的某个时间点使单个键过期。这使你可以专注于数据流和业务逻辑,而无需担心复杂的清理任务。
在 SemanticCache
上设置的 TTL 策略允许你暂时保留缓存条目。将 TTL 策略设置为 5 秒。
llmcache.set_ttl(5) # 5 seconds
llmcache.store("This is a TTL test", "This is a TTL test response")
time.sleep(5)
# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")
print(result)
[]
# Reset the TTL to null (long lived data)
llmcache.set_ttl()
简单的性能测试
接下来,你将衡量使用 SemanticCache
获得的加速效果。你将使用 time
模块来衡量使用和不使用 SemanticCache
生成响应所需的时间。
def answer_question(question: str) -> str:
"""Helper function to answer a simple question using OpenAI with a wrapper
check for the answer in the semantic cache first.
Args:
question (str): User input question.
Returns:
str: Response.
"""
results = llmcache.check(prompt=question)
if results:
return results[0]["response"]
else:
answer = ask_openai(question)
return answer
start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()
print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")
Without caching, a call to openAI to answer this simple question took 0.5017588138580322 seconds.
llmcache.store(prompt=question, response="George Washington")
# Calculate the avg latency for caching over LLM usage
times = []
for _ in range(10):
cached_start = time.time()
cached_answer = answer_question(question)
cached_end = time.time()
times.append(cached_end-cached_start)
avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")
启用 LLM 缓存的平均时间:0.2560166358947754 节省的时间百分比:82.47%
```bash
# check the stats of the index
$ rvl stats -i llmcache
Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key │ Value │
├─────────────────────────────┼─────────────┤
│ num_docs │ 1 │
│ num_terms │ 19 │
│ max_doc_id │ 3 │
│ num_records │ 23 │
│ percent_indexed │ 1 │
│ hash_indexing_failures │ 0 │
│ number_of_uses │ 19 │
│ bytes_per_record_avg │ 5.30435 │
│ doc_table_size_mb │ 0.000134468 │
│ inverted_sz_mb │ 0.000116348 │
│ key_table_size_mb │ 2.76566e-05 │
│ offset_bits_per_record_avg │ 8 │
│ offset_vectors_sz_mb │ 2.09808e-05 │
│ offsets_per_term_avg │ 0.956522 │
│ records_per_doc_avg │ 23 │
│ sortable_values_size_mb │ 0 │
│ total_indexing_time │ 1.211 │
│ total_inverted_index_blocks │ 19 │
│ vector_index_sz_mb │ 3.0161 │
╰─────────────────────────────┴─────────────╯
# Clear the cache AND delete the underlying index
llmcache.delete()