从零开始搭建 RAG：使用 RedisVL 构建你的第一个 RAG Pipeline

当你开始尝试新事物时，是否也曾有过兴奋与不确定交织的心情？这正是我着手使用 Redis 向量库构建检索增强生成（RAG）Pipeline 时的感受。

RAG 听起来可能只是另一个流行词，但它实际上是关于利用语义搜索和大型语言模型的强大功能，创造更智能、更高效的信息查找和使用方式。对于像我这样不断寻找提升技术技能途径的人来说，这是一个完美的挑战。

接下来是一段跌宕起伏的经历，有成功、有挫折，也有“恍然大悟”的时刻。从理解如何为语义搜索预处理数据，到学习向量嵌入和 Schema 设计，这个项目既是关于构建，更是关于探索。

这是我分享整个经历的方式——包括成功、挑战以及介于两者之间的一切。如果你对 Redis、RedisVL、RAG 感兴趣，或者只是对攻克技术项目感到好奇，那么这篇文章就是为你准备的。

我是谁？一个能与开发者沟通的 PMM

我是 Rini，我的技术之旅绝非一帆风顺。我最初是一名后端软件工程师，埋头于代码之中，解决棘手的技术难题。但随着时间的推移，我发现自己被另一种问题解决方式所吸引：理解开发者的需求，并以真正能引起用户共鸣的方式将技术产品带入生活。这就是我目前在 Redis 担任 AI 产品市场经理 (PMM) 的原因。

尽管我已将日常编码工作换成了营销策略，但我对技术方面的好奇心并未减退。作为一名 PMM，深入了解我们的产品是帮助用户充分利用产品的关键。这就是我撸起袖子构建我的第一个 RAG Pipeline 的原因。

Redis 向量库感觉是一个很好的起点。它处于智能搜索和 AI 驱动应用的前沿，给了我探索 RAG 的机会——这是一种我一直以来都听闻很多的技术。

我的目标是什么？

我项目的主要重点是使用 Redis 向量库从零开始构建一个 RAG Pipeline。我参考了这篇 Redis 教程，开始学习 RedisVL。RAG 是一项令人兴奋的技术，它将语义搜索与大型语言模型 (LLMs) 相结合，以检索相关信息并生成准确、上下文感知的答案。通过攻克这个项目，我的目标是理解 RAG 的基本原理以及 Redis 如何为这类应用提供支持。

Redis 向量库是本项目的重要工具。它简化了向量嵌入的操作，使得快速准确的语义搜索成为可能，这对于构建功能性的 RAG Pipeline 至关重要。RedisVL 使存储、搜索和获取重要数据变得容易。

我最终构建了一个能工作的 AI 助手，可以回答有关最近 Nike 财报电话会议的问题。它从 Nike 的财报中提取相关上下文，并使用 LLM 生成准确、上下文感知的回答。

我学到了使用 RAG 和 Redis 设置一个 AI 助手是多么容易。除了技术实现之外，这个项目还突显了 Redis 和 RAG 等工具如何在现实世界中产生影响。将它扩展到金融、医疗、教育等行业，你就能拥有提供即时洞察、让关键信息易于操作的 AI 助手。

首先，我进行了基础设置

要跟着操作，你可以使用我们AI 资源开发者中心上的这篇教程。

设置你的环境

克隆所需的 GitHub 仓库以访问数据集和资源

!git clone https://github.com/redis-developer/redis-ai-resources.git temp_repo 
!mv temp_repo/python-recipes/RAG/resources . 
!rm -rf temp_repo

安装 Python 依赖项，包括 redis、redisVL 和 LangChain。

!pip install -q redis redisvl langchain_community pypdf sentence-transformers langchain openai

安装和配置 Redis

在本地设置一个 Redis Stack 实例，用于存储、索引、查询向量嵌入。

%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

配置 Redis 连接 URL，使其可与本地或云实例一起使用。

import os
import warnings
#warnings.filterwarnings('ignore')

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

准备数据集

使用 LangChain 的 PyPDFLoader 加载一份财务 10k PDF 文档。

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# Load list of pdfs from a folder
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# pick out the Nike doc for this exercise
doc = [doc for doc in docs if "nke" in doc][0]

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=2500, chunk_overlap=0
)
loader = PyPDFLoader(doc, headers = None)
# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

这是在 Colab 中应该看到的输出。

使用 RecursiveCharacterTextSplitter 将文档预处理并分割成可管理的块。

from redisvl.utils.vectorize import HFTextVectorizer
import pandas as pd
from tqdm.auto import tqdm

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Embed each chunk content
embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(embeddings) == len(chunks)

定义 Schema 并创建索引

设计一个 Redis 索引 Schema，包含文本、标签和向量嵌入字段。

from redis import Redis
from redisvl.index import SearchIndex

index_name = "redisvl"

schema = {
 "index": {
   "name": index_name,
   "prefix": "chunk"
 },
 "fields": [
   {
       "name": "chunk_id",
       "type": "tag",
       "attrs": {
           "sortable": True
       }
   },
   {
       "name": "content",
       "type": "text"
   },
   {
       "name": "text_embedding",
       "type": "vector",
       "attrs": {
           "dims": 384,
           "distance_metric": "cosine",
           "algorithm": "hnsw",
           "datatype": "float32"
       }
   }
 ]
}

在 Redis 中配置索引，使语义搜索高效工作。

# connect to redis
client = Redis.from_url(REDIS_URL)
# create an index from schema and the client
index = SearchIndex.from_dict(schema)
index.set_client(client)
index.create(overwrite=True, drop=True)

# use the RedisVL CLI tool to list all indices
!rvl index listall
# get info about the index
!rvl index info -i redisvl

将数据加载到 Redis 中

处理预处理后的块及其嵌入，并将它们加载到 Redis 索引中。

# load expects an iterable of dictionaries
from redisvl.redis.utils import array_to_buffer

data = [
   {
       'chunk_id': i,
       'content': chunk.page_content,
       # For HASH -- must convert embeddings to bytes
       'text_embedding': array_to_buffer(embeddings[i], dtype='float32')
   } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data, id_field="chunk_id")

查询数据库

构建向量查询以查找与用户查询语义相似的文本块。

from redisvl.query import VectorQuery
query = "Nike profit margins and company performance"
query_embedding = hf.embed(query)
vector_query = VectorQuery(
   vector=query_embedding,
   vector_field_name="text_embedding",
   num_results=3,
   return_fields=["chunk_id", "content"],
   return_score=True
)

# show the raw redis query
str(vector_query)

# execute the query with RedisVL
result=index.query(vector_query)

# view the results
pd.DataFrame(result)

这些是 Colab notebook 中输出的结果。

# paginate through results
for result in index.paginate(vector_query, page_size=1):
   print(result[0]["chunk_id"], result[0]["vector_distance"], flush=True)

这是在 Colab notebook 中进行结果分页的样子。

执行相似性搜索，提取相关结果，并探索额外的过滤/排序选项。

# Sort by chunk_id field after vector search limits to topK
vector_query = VectorQuery(
   vector=query_embedding,
   vector_field_name="text_embedding",
   num_results=4,
   return_fields=["chunk_id"],
   return_score=True
)

# Decompose vector_query into the core query and the params
query = vector_query.query
params = vector_query.params

# Pass query and params direct to index.search()
result = index.search(
   query.sort_by("chunk_id", asc=True),
   params
)

pd.DataFrame([doc.__dict__ for doc in result.docs])

from redisvl.query.filter import Text

vector_query = VectorQuery(
   vector=query_embedding,
   vector_field_name="text_embedding",
   num_results=4,
   return_fields=["content"],
   return_score=True
)

# Set a text filter
text_filter = Text("content") % "profit"

vector_query.set_filter(text_filter)

result=index.query(vector_query)
pd.DataFrame(result)

这些是 Colab notebook 中的查询结果。

from redisvl.query import RangeQuery
range_query = RangeQuery(
   vector=query_embedding,
   vector_field_name="text_embedding",
   num_results=4,
   return_fields=["content"],
   return_score=True,
   distance_threshold=0.8  # find all items with a semantic distance of less than 0.8
)

result=index.query(range_query)
pd.DataFrame(result)

# Add filter to range query
range_query.set_filter(text_filter)

index.query(range_query)
pd.DataFrame(result)

这些是 Colab notebook 中的范围查询结果。

构建 RAG Pipeline

设置 RedisVL AsyncSearchIndex。这是一个用于在异步环境中创建和管理搜索索引的工具，它支持高并发应用的非阻塞操作。它允许你定义数据 Schema、加载和查询数据，并高效地执行基于向量的搜索，使其成为像 RAG Pipeline 这样的可扩展 AI 工作流的理想选择。

from redis.asyncio import Redis as AsyncRedis
from redisvl.index import AsyncSearchIndex

client = AsyncRedis.from_url(REDIS_URL)
async_index = AsyncSearchIndex.from_dict(schema)
await async_index.set_client(client)

集成 OpenAI 的 GPT 模型 (gpt-3.5-turbo-0125)，根据检索结果生成上下文感知的回答。

import openai
import os
import getpass
CHAT_MODEL = "gpt-3.5-turbo-0125"

if "OPENAI_API_KEY" not in os.environ:
   os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY :")

使用结构化 Prompt 结合用户问题和相关文档上下文，以获得最佳回答。

async def answer_question(index: AsyncSearchIndex, query: str):
   """Answer the user's question"""

   SYSTEM_PROMPT = """You are a helpful financial analyst assistant that has access
   to public financial 10k documents in order to answer users questions about company
   performance, ethics, characteristics, and core information.
   """

   query_vector = hf.embed(query)
   # Fetch context from Redis using vector search
   context = await retrieve_context(index, query_vector)
   # Generate contextualized prompt and feed to OpenAI
   response = await openai.AsyncClient().chat.completions.create(
       model=CHAT_MODEL,
       messages=[
           {"role": "system", "content": SYSTEM_PROMPT},
           {"role": "user", "content": promptify(query, context)}
       ],
       temperature=0.1,
       seed=42
   )
   # Response provided by LLM
   return response.choices[0].message.content


async def retrieve_context(async_index: AsyncSearchIndex, query_vector) -> str:
   """Fetch the relevant context from Redis using vector search"""
   results = await async_index.query(
       VectorQuery(
           vector=query_vector,
           vector_field_name="text_embedding",
           return_fields=["content"],
           num_results=3
       )
   )
   content = "\n".join([result["content"] for result in results])
   return content


def promptify(query: str, context: str) -> str:
   return f'''Use the provided context below derived from public financial
   documents to answer the user's question. If you can't answer the user's
   question, based on the context; do not guess. If there is no context at all,
   respond with "I don't know".

   User question:

   {query}

   Helpful context:

   {context}

   Answer:
   '''

现在我们已经完成了所有设置，可以就财报提问了。

测试 Pipeline

提出财务问题（例如收入趋势、ESG 实践）来测试 RAG Pipeline。

# Generate a list of questions
questions = [
   "What is the trend in the company's revenue and profit over the past few years?",
   "What are the company's primary revenue sources?",
   "How much debt does the company have, and what are its capital expenditure plans?",
   "What does the company say about its environmental, social, and governance (ESG) practices?",
   "What is the company's strategy for growth?"
]

import asyncio

results = await asyncio.gather(*[
   answer_question(async_index, question) for question in questions
])

查看结果

检索准确的、基于上下文的回答，展示 Pipeline 的有效性。

for i, r in enumerate(results):
   print(f"Question: {questions[i]}")
   print(f"Answer: \n {r}", "\n-----------\n")

在这里，我们可以最终看到 Colab notebook 中我们提出的一些问题的答案。

answers to our questions from the Colab notebook.

我的旅程亮点

这个项目的许多部分都出奇地直观且令人愉快。探索 PyPDFLoader 和 RecursiveCharacterTextSplitter 文档是一个亮点，因为它向我展示了从 PDF 文档预处理和构建文本成有意义的块是多么容易。

RedisVL 因其简洁性和效率而脱颖而出。使用 HFTextVectorizer 生成文本嵌入和集成 Hugging Face 模型等任务感觉非常顺畅。RedisVL 处理向量搜索的能力使得存储、索引和检索相关数据变得容易，这对于构建 RAG Pipeline 至关重要。

此外，定义 Schema、将数据加载到 Redis 以及查询数据库的过程也很顺利。看着所有部分协同工作令人欣慰，并展示了这些工具是多么强大和用户友好。整个过程既有益又有趣。

曲折的道路

我在项目中也遇到了一些小插曲，为了完成我的 AI 助手，我不得不学习一些新知识。首先，我必须快速掌握一些独特的技术概念，例如 Hugging Face 模型、向量嵌入和语义搜索。这些对我来说是全新的，因此需要额外的阅读和实验才能完全理解它们是如何协同工作的。

另一个障碍是在测试 RAG Pipeline 时遇到了 OpenAI API 的速率限制。每次对 API 的查询都返回“配额超出”错误，有效地阻碍了进展。为了解决这个问题，我曾考虑在 API 请求之间引入延迟以避免超出速率限制。然而，我选择的另一个解决方案是更换为属于商业计划的不同 OpenAI API 密钥。这使我能够继续测试并成功完成 Pipeline。

主要收获

完成这个项目并使用 RedisVL 构建 RAG Pipeline 让我受益匪浅。RedisVL 作为一个强大的工具脱颖而出，它使得向量搜索和嵌入管理变得简单高效。它与 Hugging Face 模型的无缝集成突显了其对复杂 AI 工作流的良好支持，并且它在为 RAG Pipeline 提供语义搜索能力方面发挥了不可估量的作用。这个项目也加深了我对 RAG Pipeline 的理解，以及它们如何结合语义搜索和大型语言模型来提供精确的、上下文感知的答案。

另一个重要经验是有效数据预处理的重要性。使用 PyPDFLoader 和 RecursiveCharacterTextSplitter 等工具使得数据结构化变得直观，并确保了 Pipeline 的其余部分顺利运行。

最终，这次亲自动手的探索强化了即使作为 PMM，也应以开发者的思维方式来使用 Redis 等工具的重要性。这是一段充满收获的旅程，它展示了 Redis 和 RAG Pipeline 的潜力，并让我渴望尝试更高级的用例。

下一步

如果你像我一样对 Redis 和 RAG Pipeline 的可能性感到好奇，最好的入门方法就是亲自动手尝试。实践经验是无价的，要按照我的步骤操作，你可以使用这个Colab notebook 来构建我搭建的同一个 RAG Pipeline。

如需更深入的指导，请务必查阅 RedisVL 文档。这是一个丰富的资源，包含详细的指南和示例，可以帮助你理解 RedisVL 提供的全部功能。

你也可以考虑使用 RedisVL 来增强你当前的项目或激发新的创意。从为搜索引擎提供动力到优化推荐系统，其应用范围是无限的。RedisVL 可以帮助你在工作中解决现实世界的挑战。

最后，如果你正在寻找更多灵感，请查阅适用于 AI 的 Redis 文档。它提供了额外的用例、教程和实践项目，帮助你探索利用 RedisVL 的其他方式。