矢量化器
支持的向量化器
在本文档中,您将了解如何使用 RedisVL 通过内置文本嵌入向量化器创建嵌入。RedisVL 支持
- OpenAI
- HuggingFace
- Vertex AI
- Cohere
开始之前,请确保以下事项
- 您已安装 RedisVL 并激活了该环境。
- 您有一个能够进行搜索和查询操作的 Redis 实例正在运行。
# import necessary modules
import os
创建文本嵌入
此示例将演示如何在 RedisVL 中使用多种不同的文本向量化器,用三个简单句子创建嵌入。
- "That is a happy dog"
- "That is a happy person"
- "Today is a nice day"
OpenAI
OpenAITextVectorizer
能够轻松地将 RedisVL 与 OpenAI 中的嵌入模型结合使用。为此,您需要安装 openai
。
pip install openai
import getpass
# setup the API Key
api_key = os.environ.get("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
from redisvl.utils.vectorize import OpenAITextVectorizer
# create a vectorizer
oai = OpenAITextVectorizer(
model="text-embedding-ada-002",
api_config={"api_key": api_key},
)
test = oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
Vector dimensions: 1536
[-0.001025049015879631,
-0.0030993607360869646,
0.0024536605924367905,
-0.004484387580305338,
-0.010331203229725361,
0.012700922787189484,
-0.005368996877223253,
-0.0029411641880869865,
-0.0070833307690918446,
-0.03386051580309868]
# Create many embeddings at once
sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = oai.embed_many(sentences)
embeddings[0][:10]
[-0.01747742109000683,
-5.228330701356754e-05,
0.0013870716793462634,
-0.025637786835432053,
-0.01985435001552105,
0.016117358580231667,
-0.0037306349258869886,
0.0008945261361077428,
0.006577865686267614,
-0.025091219693422318]
# openai also supports asyncronous requests, which you can use to speed up the vectorization process.
embeddings = await oai.aembed_many(sentences)
print("Number of Embeddings:", len(embeddings))
Number of Embeddings: 3
Huggingface
Huggingface 是一个流行的自然语言处理 (NLP) 平台,其中提供许多可以立即使用的预训练模型。RedisVL 支持将 Huggingface 的“句子转换器”用于根据文本创建嵌入。若要使用 Huggingface,您需要安装 sentence-transformers
库。
pip install sentence-transformers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer
# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")
# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]
[0.00037810884532518685,
-0.05080341175198555,
-0.03514723479747772,
-0.02325104922056198,
-0.044158220291137695,
0.020487844944000244,
0.0014617963461205363,
0.031261757016181946,
0.05605152249336243,
0.018815357238054276]
# You can also create many embeddings at once
embeddings = hf.embed_many(sentences, as_buffer=True)
VertexAI
VertexAI 是 GCP 全功能的 AI 平台,其中包含多种经预训练的 LLM。RedisVL 支持使用 VertexAI 根据这些模型创建嵌入。若要使用 VertexAI,您首先需要安装 google-cloud-aiplatform
库。
pip install google-cloud-aiplatform>=1.26
然后,您需要访问Google Cloud Project并提供凭证访问权限。这可通过将 GOOGLE_APPLICATION_CREDENTIALS
环境变量设置为从 GCP 中下载的服务帐号的 JSON 密钥文件的路径来完成。
最后,您需要找到您的项目 ID和VertexAI 的地理区域。
确保设置以下 env 变量
GOOGLE_APPLICATION_CREDENTIALS=<path to your gcp JSON creds>
GCP_PROJECT_ID=<your gcp project id>
GCP_LOCATION=<your gcp geo region for vertex ai>
from redisvl.utils.vectorize import VertexAITextVectorizer
# create a vectorizer
vtx = VertexAITextVectorizer(api_config={
"project_id": os.environ.get("GCP_PROJECT_ID") or getpass.getpass("Enter your GCP Project ID: "),
"location": os.environ.get("GCP_LOCATION") or getpass.getpass("Enter your GCP Location: "),
"google_application_credentials": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or getpass.getpass("Enter your Google App Credentials path: ")
})
# embed a sentence
test = vtx.embed("This is a test sentence.")
test[:10]
[0.04373306408524513,
-0.05040992051362991,
-0.011946038343012333,
-0.043528858572244644,
0.021510830149054527,
0.028604144230484962,
0.014770914800465107,
-0.01610461436212063,
-0.0036560404114425182,
0.013746795244514942]
Cohere
Cohere 允许您在产品中实现语言 AI。CohereTextVectorizer
简化了将 RedisVL 与 Cohere 中的嵌入模型结合使用的过程。为此,您需要安装 cohere
。
pip install cohere
import getpass
# set up the API Key
api_key = os.environ.get("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")
每个 embed
调用中的 input_type
参数需要特别注意。例如,为了嵌入查询,您应该设置 input_type='search_query'
。为了嵌入文档,设置 input_type='search_document'
。另请参阅此处的更多信息。
from redisvl.utils.vectorize import CohereTextVectorizer
# create a vectorizer
co = CohereTextVectorizer(
model="embed-english-v3.0",
api_config={"api_key": api_key},
)
# embed a search query
test = co.embed("This is a test sentence.", input_type='search_query')
print("Vector dimensions: ", len(test))
print(test[:10])
# embed a document
test = co.embed("This is a test sentence.", input_type='search_document')
print("Vector dimensions: ", len(test))
print(test[:10])
Vector dimensions: 1024
[-0.010856628, -0.019683838, -0.0062179565, 0.003545761, -0.047943115, 0.0009365082, -0.005924225, 0.016174316, -0.03289795, 0.049194336]
Vector dimensions: 1024
[-0.009712219, -0.016036987, 2.8073788e-05, -0.022491455, -0.041259766, 0.002281189, -0.033294678, -0.00057029724, -0.026260376, 0.0579834]
通过这个专门的用户指南,详细了解 RedisVL 与 Cohere 的结合使用。
使用提供商嵌入搜索
创建了词嵌入后,您可以使用它们来搜索相似的句子。您将使用上面相同的三个句子并搜索相似的句子。
首先,为索引创建架构。
HuggingFace 向量化的示例架构在 YAML 中如下所示
version: '0.1.0'
index:
name: vectorizers
prefix: doc
storage_type: hash
fields:
- name: sentence
type: text
- name: embedding
type: vector
attrs:
dims: 768
algorithm: flat
distance_metric: cosine
from redisvl.index import SearchIndex
# construct a search index from the schema
index = SearchIndex.from_yaml("./schema.yaml")
# connect to local redis instance
index.connect("redis://localhost:6379")
# create the index (no data yet)
index.create(overwrite=True)
# use the CLI to see the created index
!rvl index listall
22:02:27 [RedisVL] INFO Indices:
22:02:27 [RedisVL] INFO 1. vectorizers
# load expects an iterable of dictionaries where
# the vector is stored as a bytes buffer
data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]
index.load(data)
['doc:17c401b679ce43cb82f3ab2280ad02f2',
'doc:3fc0502bec434b17a3f06e20824b2e59',
'doc:199f17b0e5d24dcaa1fd4fb41558150c']
from redisvl.query import VectorQuery
# use the HuggingFace vectorizer again to create a query embedding
query_embedding = hf.embed("That is a happy cat")
query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["text"],
num_results=3
)
results = index.query(query)
for doc in results:
print(doc["text"], doc["vector_distance"])
That is a happy dog 0.160862326622
That is a happy person 0.273598492146
Today is a sunny day 0.744559407234
# cleanup
index.delete()