矢量化器

支持的向量化器

在本文档中,您将了解如何使用 RedisVL 通过内置文本嵌入向量化器创建嵌入。RedisVL 支持

  1. OpenAI
  2. HuggingFace
  3. Vertex AI
  4. Cohere
注意
本文档是从这个 Jupyter 笔记本转换而来的。

开始之前,请确保以下事项

  1. 您已安装 RedisVL 并激活了该环境。
  2. 您有一个能够进行搜索和查询操作的 Redis 实例正在运行。
# import necessary modules
import os

创建文本嵌入

此示例将演示如何在 RedisVL 中使用多种不同的文本向量化器,用三个简单句子创建嵌入。

  • "That is a happy dog"
  • "That is a happy person"
  • "Today is a nice day"

OpenAI

OpenAITextVectorizer 能够轻松地将 RedisVL 与 OpenAI 中的嵌入模型结合使用。为此,您需要安装 openai

pip install openai
import getpass

# setup the API Key
api_key = os.environ.get("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
from redisvl.utils.vectorize import OpenAITextVectorizer

# create a vectorizer
oai = OpenAITextVectorizer(
    model="text-embedding-ada-002",
    api_config={"api_key": api_key},
)

test = oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]

Vector dimensions:  1536

[-0.001025049015879631,
 -0.0030993607360869646,
 0.0024536605924367905,
 -0.004484387580305338,
 -0.010331203229725361,
 0.012700922787189484,
 -0.005368996877223253,
 -0.0029411641880869865,
 -0.0070833307690918446,
 -0.03386051580309868]
# Create many embeddings at once
sentences = [
    "That is a happy dog",
    "That is a happy person",
    "Today is a sunny day"
]

embeddings = oai.embed_many(sentences)
embeddings[0][:10]

[-0.01747742109000683,
 -5.228330701356754e-05,
 0.0013870716793462634,
 -0.025637786835432053,
 -0.01985435001552105,
 0.016117358580231667,
 -0.0037306349258869886,
 0.0008945261361077428,
 0.006577865686267614,
 -0.025091219693422318]
# openai also supports asyncronous requests, which you can use to speed up the vectorization process.
embeddings = await oai.aembed_many(sentences)
print("Number of Embeddings:", len(embeddings))

Number of Embeddings: 3

Huggingface

Huggingface 是一个流行的自然语言处理 (NLP) 平台,其中提供许多可以立即使用的预训练模型。RedisVL 支持将 Huggingface 的“句子转换器”用于根据文本创建嵌入。若要使用 Huggingface,您需要安装 sentence-transformers 库。

pip install sentence-transformers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer

# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]

[0.00037810884532518685,
 -0.05080341175198555,
 -0.03514723479747772,
 -0.02325104922056198,
 -0.044158220291137695,
 0.020487844944000244,
 0.0014617963461205363,
 0.031261757016181946,
 0.05605152249336243,
 0.018815357238054276]
# You can also create many embeddings at once
embeddings = hf.embed_many(sentences, as_buffer=True)

VertexAI

VertexAI 是 GCP 全功能的 AI 平台,其中包含多种经预训练的 LLM。RedisVL 支持使用 VertexAI 根据这些模型创建嵌入。若要使用 VertexAI,您首先需要安装 google-cloud-aiplatform 库。

pip install google-cloud-aiplatform>=1.26

然后,您需要访问Google Cloud Project并提供凭证访问权限。这可通过将 GOOGLE_APPLICATION_CREDENTIALS 环境变量设置为从 GCP 中下载的服务帐号的 JSON 密钥文件的路径来完成。

最后,您需要找到您的项目 IDVertexAI 的地理区域

确保设置以下 env 变量

GOOGLE_APPLICATION_CREDENTIALS=<path to your gcp JSON creds>
GCP_PROJECT_ID=<your gcp project id>
GCP_LOCATION=<your gcp geo region for vertex ai>
from redisvl.utils.vectorize import VertexAITextVectorizer


# create a vectorizer
vtx = VertexAITextVectorizer(api_config={
    "project_id": os.environ.get("GCP_PROJECT_ID") or getpass.getpass("Enter your GCP Project ID: "),
    "location": os.environ.get("GCP_LOCATION") or getpass.getpass("Enter your GCP Location: "),
    "google_application_credentials": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or getpass.getpass("Enter your Google App Credentials path: ")
})

# embed a sentence
test = vtx.embed("This is a test sentence.")
test[:10]

[0.04373306408524513,
 -0.05040992051362991,
 -0.011946038343012333,
 -0.043528858572244644,
 0.021510830149054527,
 0.028604144230484962,
 0.014770914800465107,
 -0.01610461436212063,
 -0.0036560404114425182,
 0.013746795244514942]

Cohere

Cohere 允许您在产品中实现语言 AI。CohereTextVectorizer 简化了将 RedisVL 与 Cohere 中的嵌入模型结合使用的过程。为此,您需要安装 cohere

pip install cohere
import getpass
# set up the API Key
api_key = os.environ.get("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")

每个 embed 调用中的 input_type 参数需要特别注意。例如,为了嵌入查询,您应该设置 input_type='search_query'。为了嵌入文档,设置 input_type='search_document'。另请参阅此处的更多信息。

from redisvl.utils.vectorize import CohereTextVectorizer

# create a vectorizer
co = CohereTextVectorizer(
    model="embed-english-v3.0",
    api_config={"api_key": api_key},
)

# embed a search query
test = co.embed("This is a test sentence.", input_type='search_query')
print("Vector dimensions: ", len(test))
print(test[:10])

# embed a document
test = co.embed("This is a test sentence.", input_type='search_document')
print("Vector dimensions: ", len(test))
print(test[:10])

Vector dimensions:  1024
[-0.010856628, -0.019683838, -0.0062179565, 0.003545761, -0.047943115, 0.0009365082, -0.005924225, 0.016174316, -0.03289795, 0.049194336]
Vector dimensions:  1024
[-0.009712219, -0.016036987, 2.8073788e-05, -0.022491455, -0.041259766, 0.002281189, -0.033294678, -0.00057029724, -0.026260376, 0.0579834]

通过这个专门的用户指南,详细了解 RedisVL 与 Cohere 的结合使用。

使用提供商嵌入搜索

创建了词嵌入后,您可以使用它们来搜索相似的句子。您将使用上面相同的三个句子并搜索相似的句子。

首先,为索引创建架构。

HuggingFace 向量化的示例架构在 YAML 中如下所示

version: '0.1.0'

index:
    name: vectorizers
    prefix: doc
    storage_type: hash

fields:
    - name: sentence
      type: text
    - name: embedding
      type: vector
      attrs:
        dims: 768
        algorithm: flat
        distance_metric: cosine
from redisvl.index import SearchIndex

# construct a search index from the schema
index = SearchIndex.from_yaml("./schema.yaml")

# connect to local redis instance
index.connect("redis://localhost:6379")

# create the index (no data yet)
index.create(overwrite=True)
# use the CLI to see the created index
!rvl index listall

22:02:27 [RedisVL] INFO   Indices:
22:02:27 [RedisVL] INFO   1. vectorizers
# load expects an iterable of dictionaries where
# the vector is stored as a bytes buffer

data = [{"text": t,
         "embedding": v}
        for t, v in zip(sentences, embeddings)]

index.load(data)

    ['doc:17c401b679ce43cb82f3ab2280ad02f2',
     'doc:3fc0502bec434b17a3f06e20824b2e59',
     'doc:199f17b0e5d24dcaa1fd4fb41558150c']
from redisvl.query import VectorQuery

# use the HuggingFace vectorizer again to create a query embedding
query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
    vector=query_embedding,
    vector_field_name="embedding",
    return_fields=["text"],
    num_results=3
)

results = index.query(query)
for doc in results:
    print(doc["text"], doc["vector_distance"])

That is a happy dog 0.160862326622
That is a happy person 0.273598492146
Today is a sunny day 0.744559407234
# cleanup
index.delete()
RATE THIS PAGE
Back to top ↑