向量化器

在此 notebook 中,我们将展示如何使用 RedisVL 利用内置的文本嵌入向量化器创建嵌入。目前 RedisVL 支持

  1. OpenAI
  2. HuggingFace
  3. Vertex AI
  4. Cohere
  5. Mistral AI
  6. Amazon Bedrock
  7. 使用自己的向量化器
  8. VoyageAI

在运行此 notebook 之前,请确保:

  1. 已安装 redisvl 并已为此 notebook 激活该环境。
  2. 有一个正在运行的 Redis Stack 实例,并且 RediSearch 版本 > 2.4。

例如,您可以在本地使用 Docker 运行 Redis Stack

docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

这将在端口 6379 上运行 Redis,并在 http://localhost:8001 上运行 RedisInsight。

# import necessary modules
import os

创建文本嵌入

本示例将展示如何使用 RedisVL 中的多种不同文本向量化器从 3 个简单句子创建嵌入。

  • “那是一只快乐的狗”
  • “那是一个快乐的人”
  • “今天是美好的一天”

OpenAI

OpenAITextVectorizer 使使用 RedisVL 与 OpenAI 的嵌入模型变得简单。为此,您需要安装 openai

pip install openai
import getpass

# setup the API Key
api_key = os.environ.get("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
from redisvl.utils.vectorize import OpenAITextVectorizer

# create a vectorizer
oai = OpenAITextVectorizer(
    model="text-embedding-ada-002",
    api_config={"api_key": api_key},
)

test = oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
Vector dimensions:  1536





[-0.0011391325388103724,
 -0.003206387162208557,
 0.002380132209509611,
 -0.004501554183661938,
 -0.010328996926546097,
 0.012922565452754498,
 -0.005491119809448719,
 -0.0029864837415516376,
 -0.007327961269766092,
 -0.03365817293524742]
# Create many embeddings at once
sentences = [
    "That is a happy dog",
    "That is a happy person",
    "Today is a sunny day"
]

embeddings = oai.embed_many(sentences)
embeddings[0][:10]
[-0.017466850578784943,
 1.8471690054866485e-05,
 0.00129731057677418,
 -0.02555876597762108,
 -0.019842341542243958,
 0.01603139191865921,
 -0.0037347301840782166,
 0.0009670283179730177,
 0.006618348415941,
 -0.02497442066669464]
# openai also supports asyncronous requests, which we can use to speed up the vectorization process.
embeddings = await oai.aembed_many(sentences)
print("Number of Embeddings:", len(embeddings))
Number of Embeddings: 3

Azure OpenAI

AzureOpenAITextVectorizer 是 OpenAI 向量化器的变体,可在 Azure 内调用 OpenAI 模型。如果您已安装 openai,则即可使用 Azure OpenAI。

OpenAI 和 Azure OpenAI 之间唯一的实际区别是调用 API 所需的变量。

# additionally to the API Key, setup the API endpoint and version
api_key = os.environ.get("AZURE_OPENAI_API_KEY") or getpass.getpass("Enter your AzureOpenAI API key: ")
api_version = os.environ.get("OPENAI_API_VERSION") or getpass.getpass("Enter your AzureOpenAI API version: ")
azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT") or getpass.getpass("Enter your AzureOpenAI API endpoint: ")
deployment_name = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME", "text-embedding-ada-002")
from redisvl.utils.vectorize import AzureOpenAITextVectorizer

# create a vectorizer
az_oai = AzureOpenAITextVectorizer(
    model=deployment_name, # Must be your CUSTOM deployment name
    api_config={
        "api_key": api_key,
        "api_version": api_version,
        "azure_endpoint": azure_endpoint
    },
)

test = az_oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[7], line 4
      1 from redisvl.utils.vectorize import AzureOpenAITextVectorizer
      3 # create a vectorizer
----> 4 az_oai = AzureOpenAITextVectorizer(
      5     model=deployment_name, # Must be your CUSTOM deployment name
      6     api_config={
      7         "api_key": api_key,
      8         "api_version": api_version,
      9         "azure_endpoint": azure_endpoint
     10     },
     11 )
     13 test = az_oai.embed("This is a test sentence.")
     14 print("Vector dimensions: ", len(test))


File ~/src/redis-vl-python/redisvl/utils/vectorize/text/azureopenai.py:78, in AzureOpenAITextVectorizer.__init__(self, model, api_config, dtype)
     54 def __init__(
     55     self,
     56     model: str = "text-embedding-ada-002",
     57     api_config: Optional[Dict] = None,
     58     dtype: str = "float32",
     59 ):
     60     """Initialize the AzureOpenAI vectorizer.
     61 
     62     Args:
   (...)
     76         ValueError: If an invalid dtype is provided.
     77     """
---> 78     self._initialize_clients(api_config)
     79     super().__init__(model=model, dims=self._set_model_dims(model), dtype=dtype)


File ~/src/redis-vl-python/redisvl/utils/vectorize/text/azureopenai.py:106, in AzureOpenAITextVectorizer._initialize_clients(self, api_config)
     99 azure_endpoint = (
    100     api_config.pop("azure_endpoint")
    101     if api_config
    102     else os.getenv("AZURE_OPENAI_ENDPOINT")
    103 )
    105 if not azure_endpoint:
--> 106     raise ValueError(
    107         "AzureOpenAI API endpoint is required. "
    108         "Provide it in api_config or set the AZURE_OPENAI_ENDPOINT\
    109             environment variable."
    110     )
    112 api_version = (
    113     api_config.pop("api_version")
    114     if api_config
    115     else os.getenv("OPENAI_API_VERSION")
    116 )
    118 if not api_version:


ValueError: AzureOpenAI API endpoint is required. Provide it in api_config or set the AZURE_OPENAI_ENDPOINT                    environment variable.
# Just like OpenAI, AzureOpenAI supports batching embeddings and asynchronous requests.
sentences = [
    "That is a happy dog",
    "That is a happy person",
    "Today is a sunny day"
]

embeddings = await az_oai.aembed_many(sentences)
embeddings[0][:10]

Huggingface

Huggingface 是一个流行的 NLP 平台,拥有许多可以直接使用的预训练模型。RedisVL 支持使用 Huggingface 的“Sentence Transformers”从文本创建嵌入。要使用 Huggingface,您需要安装 sentence-transformers 库。

pip install sentence-transformers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer


# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]
# You can also create many embeddings at once
embeddings = hf.embed_many(sentences, as_buffer=True)

VertexAI

VertexAI 是 GCP 功能齐全的 AI 平台,包含许多预训练的 LLM。RedisVL 支持使用 VertexAI 从这些模型创建嵌入。要使用 VertexAI,您首先需要安装 google-cloud-aiplatform 库。

pip install google-cloud-aiplatform>=1.26
  1. 然后您需要访问 Google Cloud 项目 并提供凭据访问权限。这可以通过设置指向从 GCP 服务账号下载的 JSON 密钥文件路径的 GOOGLE_APPLICATION_CREDENTIALS 环境变量来实现。
  2. 最后,您需要找到您的 项目 IDVertexAI 的地理区域

确保已设置以下环境变量:

GOOGLE_APPLICATION_CREDENTIALS=<path to your gcp JSON creds>
GCP_PROJECT_ID=<your gcp project id>
GCP_LOCATION=<your gcp geo region for vertex ai>
from redisvl.utils.vectorize import VertexAITextVectorizer


# create a vectorizer
vtx = VertexAITextVectorizer(api_config={
    "project_id": os.environ.get("GCP_PROJECT_ID") or getpass.getpass("Enter your GCP Project ID: "),
    "location": os.environ.get("GCP_LOCATION") or getpass.getpass("Enter your GCP Location: "),
    "google_application_credentials": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or getpass.getpass("Enter your Google App Credentials path: ")
})

# embed a sentence
test = vtx.embed("This is a test sentence.")
test[:10]

Cohere

Cohere 允许您将语言 AI 集成到您的产品中。CohereTextVectorizer 使使用 RedisVL 与 Cohere 的嵌入模型变得简单。为此,您需要安装 cohere

pip install cohere
import getpass
# setup the API Key
api_key = os.environ.get("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")

对于每个 embed 调用,需要特别注意 input_type 参数。例如,对于嵌入查询,应设置 input_type='search_query';对于嵌入文档,设置 input_type='search_document'。更多信息请参阅此处

from redisvl.utils.vectorize import CohereTextVectorizer

# create a vectorizer
co = CohereTextVectorizer(
    model="embed-english-v3.0",
    api_config={"api_key": api_key},
)

# embed a search query
test = co.embed("This is a test sentence.", input_type='search_query')
print("Vector dimensions: ", len(test))
print(test[:10])

# embed a document
test = co.embed("This is a test sentence.", input_type='search_document')
print("Vector dimensions: ", len(test))
print(test[:10])

通过此专门用户指南详细了解如何一起使用 RedisVL 和 Cohere。

VoyageAI

VoyageAI 允许您将语言 AI 集成到您的产品中。VoyageAITextVectorizer 使使用 RedisVL 与 VoyageAI 的嵌入模型变得简单。为此,您需要安装 voyageai

pip install voyageai
import getpass
# setup the API Key
api_key = os.environ.get("VOYAGE_API_KEY") or getpass.getpass("Enter your VoyageAI API key: ")

对于每个 embed 调用,需要特别注意 input_type 参数。例如,对于嵌入查询,应设置 input_type='query';对于嵌入文档,设置 input_type='document'。更多信息请参阅此处

from redisvl.utils.vectorize import VoyageAITextVectorizer

# create a vectorizer
vo = VoyageAITextVectorizer(
    model="voyage-law-2",  # Please check the available models at https://docs.voyageai.com/docs/embeddings
    api_config={"api_key": api_key},
)

# embed a search query
test = vo.embed("This is a test sentence.", input_type='query')
print("Vector dimensions: ", len(test))
print(test[:10])

# embed a document
test = vo.embed("This is a test sentence.", input_type='document')
print("Vector dimensions: ", len(test))
print(test[:10])

Mistral AI

Mistral 提供 LLM 和嵌入 API,供您集成到您的产品中。MistralAITextVectorizer 使使用 RedisVL 与其嵌入模型变得简单。您需要安装 mistralai

pip install mistralai
from redisvl.utils.vectorize import MistralAITextVectorizer

mistral = MistralAITextVectorizer()

# embed a sentence using their asyncronous method
test = await mistral.aembed("This is a test sentence.")
print("Vector dimensions: ", len(test))
print(test[:10])

Amazon Bedrock

Amazon Bedrock 提供用于文本嵌入的完全托管基础模型。安装所需的依赖项

pip install 'redisvl[bedrock]'  # Installs boto3

配置 AWS 凭据

import os
import getpass

if "AWS_ACCESS_KEY_ID" not in os.environ:
    os.environ["AWS_ACCESS_KEY_ID"] = getpass.getpass("Enter AWS Access Key ID: ")
if "AWS_SECRET_ACCESS_KEY" not in os.environ:
    os.environ["AWS_SECRET_ACCESS_KEY"] = getpass.getpass("Enter AWS Secret Key: ")

os.environ["AWS_REGION"] = "us-east-1"  # Change as needed

创建嵌入

from redisvl.utils.vectorize import BedrockTextVectorizer

bedrock = BedrockTextVectorizer(
    model="amazon.titan-embed-text-v2:0"
)

# Single embedding
text = "This is a test sentence."
embedding = bedrock.embed(text)
print(f"Vector dimensions: {len(embedding)}")

# Multiple embeddings
sentences = [
    "That is a happy dog",
    "That is a happy person",
    "Today is a sunny day"
]
embeddings = bedrock.embed_many(sentences)

自定义向量化器

RedisVL 支持使用其他向量化器,并提供一个类来与任何从字符串数据生成一个或多个向量的函数兼容

from redisvl.utils.vectorize import CustomTextVectorizer

def generate_embeddings(text_input, **kwargs):
    return [0.101] * 768

custom_vectorizer = CustomTextVectorizer(generate_embeddings)

custom_vectorizer.embed("This is a test sentence.")[:10]

这使得自定义向量化器可以与其他 RedisVL 组件一起使用

from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(name="custom_cache", vectorizer=custom_vectorizer)

cache.store("this is a test prompt", "this is a test response")
cache.check("this is also a test prompt")

使用提供商嵌入进行搜索

现在我们已经创建了嵌入,可以使用它们来搜索相似的句子。我们将使用上面的 3 个句子进行相似句子的搜索。

首先,我们需要为索引创建 schema。

以下是 HuggingFace 向量化器示例的 yaml schema 样子:

version: '0.1.0'

index:
    name: vectorizers
    prefix: doc
    storage_type: hash

fields:
    - name: sentence
      type: text
    - name: embedding
      type: vector
      attrs:
        dims: 768
        algorithm: flat
        distance_metric: cosine
from redisvl.index import SearchIndex

# construct a search index from the schema
index = SearchIndex.from_yaml("./schema.yaml", redis_url="redis://localhost:6379")

# create the index (no data yet)
index.create(overwrite=True)
# use the CLI to see the created index
!rvl index listall

将数据加载到 RedisVL 很简单。它需要一个字典列表。向量存储为字节。

from redisvl.redis.utils import array_to_buffer

embeddings = hf.embed_many(sentences)

data = [{"text": t,
         "embedding": array_to_buffer(v, dtype="float32")}
        for t, v in zip(sentences, embeddings)]

index.load(data)
from redisvl.query import VectorQuery

# use the HuggingFace vectorizer again to create a query embedding
query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
    vector=query_embedding,
    vector_field_name="embedding",
    return_fields=["text"],
    num_results=3
)

results = index.query(query)
for doc in results:
    print(doc["text"], doc["vector_distance"])

选择您的浮点数据类型

当将文本作为字节数组进行嵌入时,RedisVL 支持 4 种不同的浮点数据类型:float16float32float64bfloat16,以及 2 种整数类型:int8uint8。为您的向量化器设置的 dtype 必须与您的搜索索引中定义的匹配。如果未显式设置,则默认为 float32

vectorizer = HFTextVectorizer(dtype="float16")

# subsequent calls to embed('', as_buffer=True) and embed_many('', as_buffer=True) will now encode as float16
float16_bytes = vectorizer.embed('test sentence', as_buffer=True)

# to generate embeddings with different dtype instantiate a new vectorizer
vectorizer_64 = HFTextVectorizer(dtype='float64')
float64_bytes = vectorizer_64.embed('test sentence', as_buffer=True)

float16_bytes != float64_bytes
# cleanup
index.delete()
评价本页
返回顶部 ↑