向量化器
在此 notebook 中,我们将展示如何使用 RedisVL 利用内置的文本嵌入向量化器创建嵌入。目前 RedisVL 支持
- OpenAI
- HuggingFace
- Vertex AI
- Cohere
- Mistral AI
- Amazon Bedrock
- 使用自己的向量化器
- VoyageAI
在运行此 notebook 之前,请确保:
- 已安装
redisvl
并已为此 notebook 激活该环境。 - 有一个正在运行的 Redis Stack 实例,并且 RediSearch 版本 > 2.4。
例如,您可以在本地使用 Docker 运行 Redis Stack
docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
这将在端口 6379 上运行 Redis,并在 http://localhost:8001 上运行 RedisInsight。
# import necessary modules
import os
创建文本嵌入
本示例将展示如何使用 RedisVL 中的多种不同文本向量化器从 3 个简单句子创建嵌入。
- “那是一只快乐的狗”
- “那是一个快乐的人”
- “今天是美好的一天”
OpenAI
OpenAITextVectorizer
使使用 RedisVL 与 OpenAI 的嵌入模型变得简单。为此,您需要安装 openai
。
pip install openai
import getpass
# setup the API Key
api_key = os.environ.get("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
from redisvl.utils.vectorize import OpenAITextVectorizer
# create a vectorizer
oai = OpenAITextVectorizer(
model="text-embedding-ada-002",
api_config={"api_key": api_key},
)
test = oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
Vector dimensions: 1536
[-0.0011391325388103724,
-0.003206387162208557,
0.002380132209509611,
-0.004501554183661938,
-0.010328996926546097,
0.012922565452754498,
-0.005491119809448719,
-0.0029864837415516376,
-0.007327961269766092,
-0.03365817293524742]
# Create many embeddings at once
sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = oai.embed_many(sentences)
embeddings[0][:10]
[-0.017466850578784943,
1.8471690054866485e-05,
0.00129731057677418,
-0.02555876597762108,
-0.019842341542243958,
0.01603139191865921,
-0.0037347301840782166,
0.0009670283179730177,
0.006618348415941,
-0.02497442066669464]
# openai also supports asyncronous requests, which we can use to speed up the vectorization process.
embeddings = await oai.aembed_many(sentences)
print("Number of Embeddings:", len(embeddings))
Number of Embeddings: 3
Azure OpenAI
AzureOpenAITextVectorizer
是 OpenAI 向量化器的变体,可在 Azure 内调用 OpenAI 模型。如果您已安装 openai
,则即可使用 Azure OpenAI。
OpenAI 和 Azure OpenAI 之间唯一的实际区别是调用 API 所需的变量。
# additionally to the API Key, setup the API endpoint and version
api_key = os.environ.get("AZURE_OPENAI_API_KEY") or getpass.getpass("Enter your AzureOpenAI API key: ")
api_version = os.environ.get("OPENAI_API_VERSION") or getpass.getpass("Enter your AzureOpenAI API version: ")
azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT") or getpass.getpass("Enter your AzureOpenAI API endpoint: ")
deployment_name = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME", "text-embedding-ada-002")
from redisvl.utils.vectorize import AzureOpenAITextVectorizer
# create a vectorizer
az_oai = AzureOpenAITextVectorizer(
model=deployment_name, # Must be your CUSTOM deployment name
api_config={
"api_key": api_key,
"api_version": api_version,
"azure_endpoint": azure_endpoint
},
)
test = az_oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 4
1 from redisvl.utils.vectorize import AzureOpenAITextVectorizer
3 # create a vectorizer
----> 4 az_oai = AzureOpenAITextVectorizer(
5 model=deployment_name, # Must be your CUSTOM deployment name
6 api_config={
7 "api_key": api_key,
8 "api_version": api_version,
9 "azure_endpoint": azure_endpoint
10 },
11 )
13 test = az_oai.embed("This is a test sentence.")
14 print("Vector dimensions: ", len(test))
File ~/src/redis-vl-python/redisvl/utils/vectorize/text/azureopenai.py:78, in AzureOpenAITextVectorizer.__init__(self, model, api_config, dtype)
54 def __init__(
55 self,
56 model: str = "text-embedding-ada-002",
57 api_config: Optional[Dict] = None,
58 dtype: str = "float32",
59 ):
60 """Initialize the AzureOpenAI vectorizer.
61
62 Args:
(...)
76 ValueError: If an invalid dtype is provided.
77 """
---> 78 self._initialize_clients(api_config)
79 super().__init__(model=model, dims=self._set_model_dims(model), dtype=dtype)
File ~/src/redis-vl-python/redisvl/utils/vectorize/text/azureopenai.py:106, in AzureOpenAITextVectorizer._initialize_clients(self, api_config)
99 azure_endpoint = (
100 api_config.pop("azure_endpoint")
101 if api_config
102 else os.getenv("AZURE_OPENAI_ENDPOINT")
103 )
105 if not azure_endpoint:
--> 106 raise ValueError(
107 "AzureOpenAI API endpoint is required. "
108 "Provide it in api_config or set the AZURE_OPENAI_ENDPOINT\
109 environment variable."
110 )
112 api_version = (
113 api_config.pop("api_version")
114 if api_config
115 else os.getenv("OPENAI_API_VERSION")
116 )
118 if not api_version:
ValueError: AzureOpenAI API endpoint is required. Provide it in api_config or set the AZURE_OPENAI_ENDPOINT environment variable.
# Just like OpenAI, AzureOpenAI supports batching embeddings and asynchronous requests.
sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = await az_oai.aembed_many(sentences)
embeddings[0][:10]
Huggingface
Huggingface 是一个流行的 NLP 平台,拥有许多可以直接使用的预训练模型。RedisVL 支持使用 Huggingface 的“Sentence Transformers”从文本创建嵌入。要使用 Huggingface,您需要安装 sentence-transformers
库。
pip install sentence-transformers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer
# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")
# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]
# You can also create many embeddings at once
embeddings = hf.embed_many(sentences, as_buffer=True)
VertexAI
VertexAI 是 GCP 功能齐全的 AI 平台,包含许多预训练的 LLM。RedisVL 支持使用 VertexAI 从这些模型创建嵌入。要使用 VertexAI,您首先需要安装 google-cloud-aiplatform
库。
pip install google-cloud-aiplatform>=1.26
- 然后您需要访问 Google Cloud 项目 并提供凭据访问权限。这可以通过设置指向从 GCP 服务账号下载的 JSON 密钥文件路径的
GOOGLE_APPLICATION_CREDENTIALS
环境变量来实现。 - 最后,您需要找到您的 项目 ID 和 VertexAI 的地理区域。
确保已设置以下环境变量:
GOOGLE_APPLICATION_CREDENTIALS=<path to your gcp JSON creds>
GCP_PROJECT_ID=<your gcp project id>
GCP_LOCATION=<your gcp geo region for vertex ai>
from redisvl.utils.vectorize import VertexAITextVectorizer
# create a vectorizer
vtx = VertexAITextVectorizer(api_config={
"project_id": os.environ.get("GCP_PROJECT_ID") or getpass.getpass("Enter your GCP Project ID: "),
"location": os.environ.get("GCP_LOCATION") or getpass.getpass("Enter your GCP Location: "),
"google_application_credentials": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or getpass.getpass("Enter your Google App Credentials path: ")
})
# embed a sentence
test = vtx.embed("This is a test sentence.")
test[:10]
Cohere
Cohere 允许您将语言 AI 集成到您的产品中。CohereTextVectorizer
使使用 RedisVL 与 Cohere 的嵌入模型变得简单。为此,您需要安装 cohere
。
pip install cohere
import getpass
# setup the API Key
api_key = os.environ.get("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")
对于每个 embed
调用,需要特别注意 input_type
参数。例如,对于嵌入查询,应设置 input_type='search_query'
;对于嵌入文档,设置 input_type='search_document'
。更多信息请参阅此处
from redisvl.utils.vectorize import CohereTextVectorizer
# create a vectorizer
co = CohereTextVectorizer(
model="embed-english-v3.0",
api_config={"api_key": api_key},
)
# embed a search query
test = co.embed("This is a test sentence.", input_type='search_query')
print("Vector dimensions: ", len(test))
print(test[:10])
# embed a document
test = co.embed("This is a test sentence.", input_type='search_document')
print("Vector dimensions: ", len(test))
print(test[:10])
通过此专门用户指南详细了解如何一起使用 RedisVL 和 Cohere。
VoyageAI
VoyageAI 允许您将语言 AI 集成到您的产品中。VoyageAITextVectorizer
使使用 RedisVL 与 VoyageAI 的嵌入模型变得简单。为此,您需要安装 voyageai
。
pip install voyageai
import getpass
# setup the API Key
api_key = os.environ.get("VOYAGE_API_KEY") or getpass.getpass("Enter your VoyageAI API key: ")
对于每个 embed
调用,需要特别注意 input_type
参数。例如,对于嵌入查询,应设置 input_type='query'
;对于嵌入文档,设置 input_type='document'
。更多信息请参阅此处
from redisvl.utils.vectorize import VoyageAITextVectorizer
# create a vectorizer
vo = VoyageAITextVectorizer(
model="voyage-law-2", # Please check the available models at https://docs.voyageai.com/docs/embeddings
api_config={"api_key": api_key},
)
# embed a search query
test = vo.embed("This is a test sentence.", input_type='query')
print("Vector dimensions: ", len(test))
print(test[:10])
# embed a document
test = vo.embed("This is a test sentence.", input_type='document')
print("Vector dimensions: ", len(test))
print(test[:10])
Mistral AI
Mistral 提供 LLM 和嵌入 API,供您集成到您的产品中。MistralAITextVectorizer
使使用 RedisVL 与其嵌入模型变得简单。您需要安装 mistralai
。
pip install mistralai
from redisvl.utils.vectorize import MistralAITextVectorizer
mistral = MistralAITextVectorizer()
# embed a sentence using their asyncronous method
test = await mistral.aembed("This is a test sentence.")
print("Vector dimensions: ", len(test))
print(test[:10])
Amazon Bedrock
Amazon Bedrock 提供用于文本嵌入的完全托管基础模型。安装所需的依赖项
pip install 'redisvl[bedrock]' # Installs boto3
配置 AWS 凭据
import os
import getpass
if "AWS_ACCESS_KEY_ID" not in os.environ:
os.environ["AWS_ACCESS_KEY_ID"] = getpass.getpass("Enter AWS Access Key ID: ")
if "AWS_SECRET_ACCESS_KEY" not in os.environ:
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass.getpass("Enter AWS Secret Key: ")
os.environ["AWS_REGION"] = "us-east-1" # Change as needed
创建嵌入
from redisvl.utils.vectorize import BedrockTextVectorizer
bedrock = BedrockTextVectorizer(
model="amazon.titan-embed-text-v2:0"
)
# Single embedding
text = "This is a test sentence."
embedding = bedrock.embed(text)
print(f"Vector dimensions: {len(embedding)}")
# Multiple embeddings
sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = bedrock.embed_many(sentences)
自定义向量化器
RedisVL 支持使用其他向量化器,并提供一个类来与任何从字符串数据生成一个或多个向量的函数兼容
from redisvl.utils.vectorize import CustomTextVectorizer
def generate_embeddings(text_input, **kwargs):
return [0.101] * 768
custom_vectorizer = CustomTextVectorizer(generate_embeddings)
custom_vectorizer.embed("This is a test sentence.")[:10]
这使得自定义向量化器可以与其他 RedisVL 组件一起使用
from redisvl.extensions.llmcache import SemanticCache
cache = SemanticCache(name="custom_cache", vectorizer=custom_vectorizer)
cache.store("this is a test prompt", "this is a test response")
cache.check("this is also a test prompt")
使用提供商嵌入进行搜索
现在我们已经创建了嵌入,可以使用它们来搜索相似的句子。我们将使用上面的 3 个句子进行相似句子的搜索。
首先,我们需要为索引创建 schema。
以下是 HuggingFace 向量化器示例的 yaml schema 样子:
version: '0.1.0'
index:
name: vectorizers
prefix: doc
storage_type: hash
fields:
- name: sentence
type: text
- name: embedding
type: vector
attrs:
dims: 768
algorithm: flat
distance_metric: cosine
from redisvl.index import SearchIndex
# construct a search index from the schema
index = SearchIndex.from_yaml("./schema.yaml", redis_url="redis://localhost:6379")
# create the index (no data yet)
index.create(overwrite=True)
# use the CLI to see the created index
!rvl index listall
将数据加载到 RedisVL 很简单。它需要一个字典列表。向量存储为字节。
from redisvl.redis.utils import array_to_buffer
embeddings = hf.embed_many(sentences)
data = [{"text": t,
"embedding": array_to_buffer(v, dtype="float32")}
for t, v in zip(sentences, embeddings)]
index.load(data)
from redisvl.query import VectorQuery
# use the HuggingFace vectorizer again to create a query embedding
query_embedding = hf.embed("That is a happy cat")
query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["text"],
num_results=3
)
results = index.query(query)
for doc in results:
print(doc["text"], doc["vector_distance"])
选择您的浮点数据类型
当将文本作为字节数组进行嵌入时,RedisVL 支持 4 种不同的浮点数据类型:float16
、float32
、float64
和 bfloat16
,以及 2 种整数类型:int8
和 uint8
。为您的向量化器设置的 dtype 必须与您的搜索索引中定义的匹配。如果未显式设置,则默认为 float32
。
vectorizer = HFTextVectorizer(dtype="float16")
# subsequent calls to embed('', as_buffer=True) and embed_many('', as_buffer=True) will now encode as float16
float16_bytes = vectorizer.embed('test sentence', as_buffer=True)
# to generate embeddings with different dtype instantiate a new vectorizer
vectorizer_64 = HFTextVectorizer(dtype='float64')
float64_bytes = vectorizer_64.embed('test sentence', as_buffer=True)
float16_bytes != float64_bytes
# cleanup
index.delete()