哈希存储 vs JSON 存储

Redis 开箱即用地提供了多种数据结构,可以适应您的领域特定应用和用例。在本 Notebook 中,我们将演示如何将 RedisVL 与哈希 (Hash)JSON 数据一起使用。

在运行本 Notebook 之前,请确保:

  1. 已安装 redisvl 并在本 Notebook 中激活了该环境。
  2. 有一个正在运行的 Redis Stack 或 Redis Enterprise 实例,并已激活 RediSearch > 2.4。

例如,您可以使用 Docker 在本地运行Redis Stack

docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

或者创建免费的 Redis Cloud

# import necessary modules
import pickle

from redisvl.redis.utils import buffer_to_array
from redisvl.index import SearchIndex


# load in the example data and printing utils
data = pickle.load(open("hybrid_example_data.pkl", "rb"))
from jupyterutils import result_print, table_print

table_print(data)
用户年龄职业信用评分办公地点用户嵌入
john18工程师-122.4194,37.7749b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'
derrick14医生-122.4194,37.7749b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'
nancy94医生-122.4194,37.7749b'333?\xcd\xcc\xcc=\x00\x00\x00?'
tyler100工程师-122.0839,37.3861b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?'
tim12皮肤科医生-122.0839,37.3861b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?'
taimur15CEO-122.0839,37.3861b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?'
joe35牙医-122.0839,37.3861b'fff?fff?\xcd\xcc\xcc='

哈希还是 JSON——如何选择?

这两种存储选项都提供了各种特性和权衡。下面我们将通过一个示例数据集来了解何时以及如何使用它们。

使用哈希

Redis 中的哈希是字段-值对的简单集合。可以将其视为一个包含多个“行”的可变单层字典。

{
    "model": "Deimos",
    "brand": "Ergonom",
    "type": "Enduro bikes",
    "price": 4972,
}

哈希最适合具有以下特征的用例:

  • 性能(速度)和存储空间(内存消耗)是主要关注点
  • 数据可以轻松地标准化并建模为单层字典

哈希通常是默认推荐。

# define the hash index schema
hash_schema = {
    "index": {
        "name": "user-hash",
        "prefix": "user-hash-docs",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
# construct a search index from the hash schema
hindex = SearchIndex.from_dict(hash_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
hindex.create(overwrite=True)
# show the underlying storage type
hindex.storage_type
<StorageType.HASH: 'hash'>

作为字节字符串的向量

在 Redis 中使用哈希时的一个细微之处是,所有向量化数据都必须作为字节字符串传递(用于高效存储、索引和处理)。下面可以看到一个示例

# show a single entry from the data that will be loaded
data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'}
# load hash data
keys = hindex.load(data)
!rvl stats -i user-hash
Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key                    │ Value       │
├─────────────────────────────┼─────────────┤
│ num_docs                    │ 7           │
│ num_terms                   │ 6           │
│ max_doc_id                  │ 7           │
│ num_records                 │ 44          │
│ percent_indexed             │ 1           │
│ hash_indexing_failures      │ 0           │
│ number_of_uses              │ 1           │
│ bytes_per_record_avg        │ 3.40909     │
│ doc_table_size_mb           │ 0.000767708 │
│ inverted_sz_mb              │ 0.000143051 │
│ key_table_size_mb           │ 0.000248909 │
│ offset_bits_per_record_avg  │ 8           │
│ offset_vectors_sz_mb        │ 8.58307e-06 │
│ offsets_per_term_avg        │ 0.204545    │
│ records_per_doc_avg         │ 6.28571     │
│ sortable_values_size_mb     │ 0           │
│ total_indexing_time         │ 1.053       │
│ total_inverted_index_blocks │ 18          │
│ vector_index_sz_mb          │ 0.0202332   │
╰─────────────────────────────┴─────────────╯

执行查询

创建索引并将数据加载到正确的格式后,我们可以使用 RedisVL 对索引运行查询。

from redisvl.query import VectorQuery
from redisvl.query.filter import Tag, Text, Num

t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17)

v = VectorQuery(
    vector=[0.1, 0.1, 0.5],
    vector_field_name="user_embedding",
    return_fields=["user", "credit_score", "age", "job", "office_location"],
    filter_expression=t
)


results = hindex.query(v)
result_print(results)
vector_distance用户信用评分年龄职业办公地点
0john18工程师-122.4194,37.7749
0.109129190445tyler100工程师-122.0839,37.3861
# clean up
hindex.delete()

使用 JSON

JSON 最适合具有以下特征的用例:

  • 易用性和数据模型灵活性是主要关注点
  • 应用程序数据已经是原生 JSON
  • 替换其他文档存储/数据库解决方案
# define the json index schema
json_schema = {
    "index": {
        "name": "user-json",
        "prefix": "user-json-docs",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
# construct a search index from the json schema
jindex = SearchIndex.from_dict(json_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
jindex.create(overwrite=True)
# note the multiple indices in the same database
!rvl index listall
11:54:18 [RedisVL] INFO   Indices:
11:54:18 [RedisVL] INFO   1. user-json

作为浮点数组的向量

存储在 JSON 中的向量化数据必须存储为纯粹的浮点数组(Python 列表)。下面我们将修改示例数据以适应这一点

json_data = data.copy()

for d in json_data:
    d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype='float32')
# inspect a single JSON record
json_data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}
keys = jindex.load(json_data)
# we can now run the exact same query as above
result_print(jindex.query(v))
vector_distance用户信用评分年龄职业办公地点
0john18工程师-122.4194,37.7749
0.109129190445tyler100工程师-122.0839,37.3861

清理

jindex.delete()

处理 JSON 中的嵌套数据

Redis 还支持原生的 JSON 对象。这些可以是多层(嵌套)对象,并完全支持 JSONPath 来更新/检索子元素。

{
    "name": "Specialized Stump jumper",
    "metadata": {
        "model": "Stumpjumper",
        "brand": "Specialized",
        "type": "Enduro bikes",
        "price": 3000
    },
}

完整的 JSON Path 支持

因为 Redis 支持完整的 JSON path,所以在创建索引 schema 时,需要根据元素的路径以及指向数据在对象中位置的所需 namepath 来索引和选择元素。

默认情况下,如果在 JSON 字段 schema 中未提供路径,RedisVL 将假定路径为 $.{name}。如果是嵌套的,则将路径提供为 $.object.attribute

例如

from redisvl.utils.vectorize import HFTextVectorizer

emb_model = HFTextVectorizer()

bike_data = [
    {
        "name": "Specialized Stump jumper",
        "metadata": {
            "model": "Stumpjumper",
            "brand": "Specialized",
            "type": "Enduro bikes",
            "price": 3000
        },
        "description": "The Specialized Stumpjumper is a versatile enduro bike that dominates both climbs and descents. Features a FACT 11m carbon fiber frame, FOX FLOAT suspension with 160mm travel, and SRAM X01 Eagle drivetrain. The asymmetric frame design and internal storage compartment make it a practical choice for all-day adventures."
    },
    {
        "name": "bike_2",
        "metadata": {
            "model": "Slash",
            "brand": "Trek",
            "type": "Enduro bikes",
            "price": 5000
        },
        "description": "Trek's Slash is built for aggressive enduro riding and racing. Featuring Trek's Alpha Aluminum frame with RE:aktiv suspension technology, 160mm travel, and Knock Block frame protection. Equipped with Bontrager components and a Shimano XT drivetrain, this bike excels on technical trails and enduro race courses."
    }
]

bike_data = [{**d, "bike_embedding": emb_model.embed(d["description"])} for d in bike_data]

bike_schema = {
    "index": {
        "name": "bike-json",
        "prefix": "bike-json",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {
            "name": "model",
            "type": "tag",
            "path": "$.metadata.model" # note the '$'
        },
        {
            "name": "brand",
            "type": "tag",
            "path": "$.metadata.brand"
        },
        {
            "name": "price",
            "type": "numeric",
            "path": "$.metadata.price"
        },
        {
            "name": "bike_embedding",
            "type": "vector",
            "attrs": {
                "dims": len(bike_data[0]["bike_embedding"]),
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
/Users/robert.shelton/.pyenv/versions/3.11.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
# construct a search index from the json schema
bike_index = SearchIndex.from_dict(bike_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
bike_index.create(overwrite=True)
bike_index.load(bike_data)
['bike-json:de92cb9955434575b20f4e87a30b03d5',
 'bike-json:054ab3718b984532b924946fa5ce00c6']
from redisvl.query import VectorQuery

vec = emb_model.embed("I'd like a bike for aggressive riding")

v = VectorQuery(
    vector=vec,
    vector_field_name="bike_embedding",
    return_fields=[
        "brand",
        "name",
        "$.metadata.type"
    ]
)


results = bike_index.query(v)

注意:如示例所示,如果您想从 JSON 对象中检索未索引的字段,您也需要提供完整的路径,例如 $.metadata.type

results
[{'id': 'bike-json:054ab3718b984532b924946fa5ce00c6',
  'vector_distance': '0.519989073277',
  'brand': 'Trek',
  '$.metadata.type': 'Enduro bikes'},
 {'id': 'bike-json:de92cb9955434575b20f4e87a30b03d5',
  'vector_distance': '0.657624483109',
  'brand': 'Specialized',
  '$.metadata.type': 'Enduro bikes'}]

清理

bike_index.delete()
评价此页面
回到顶部 ↑