在本文中,我们将探讨使用 RedisGears 和 RedisAI 部署 大型 BERT 问答 Transformer 模型(来自 Hugging Face)所面临的挑战和机遇,同时利用 RedisGears 和 RedisAI 来完成繁重的工作,并利用内存中的数据存储 Redis。
在数据科学工作负载中
但是,在面向客户的应用程序中
因此,在我们进一步讨论之前, 您为什么要阅读本文? 以下是一些数字,可以激发您的灵感
首先
python3 transformers_plain_bert_qa.py
airborne transmission of respiratory infections is the lack of established methods for the detection of airborne respiratory microorganisms
10.351818372 seconds
上面的脚本使用了一个略微修改的 Transformer,它来自 BERT QA 的默认管道,在服务器上运行它需要 10 秒。服务器使用最新的第 12 代 Intel(R) Core(TM) i9-12900K, 完整的 cpuinfo。
然而
time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' https://localhost:8080/qasearch
real 0m0.747s
user 0m0.004s
sys 0m0.000s
该脚本在 每个分片上运行 BERT QA 推理,默认情况下,每个分片等于可用 CPU 的数量,并在不到一秒的时间内返回答案。
不可思议吧?让我们深入探讨吧!
在 BERT QA 管道(或任何其他现代 NLP 推理任务中),有两个步骤
使用 Redis,我们可以预先计算所有内容并将其存储在内存中,但我们该怎么做呢?与摘要 ML 学习任务不同,问题事先未知,因此我们无法预先计算所有可能的答案。但是,我们可以使用 RedisGears 预先标记化所有潜在的答案(即数据集中的所有段落)
def parse_sentence(record):
import redisAI
import numpy as np
global tokenizer
if not tokenizer:
tokenizer=loadTokeniser()
hash_tag="{%s}" % hashtag()
for idx, value in sorted(record['value'].items(), key=lambda item: int(item[0])):
tokens = tokenizer.encode(value, add_special_tokens=False, max_length=511, truncation=True, return_tensors="np")
tokens = np.append(tokens,tokenizer.sep_token_id).astype(np.int64)
tensor=redisAI.createTensorFromBlob('INT64', tokens.shape, tokens.tobytes())
key_prefix='sentence:'
sentence_key=remove_prefix(record['key'],key_prefix)
token_key = f"tokenized:bert:qa:{sentence_key}:{idx}"
redisAI.setTensorInKey(token_key, tensor)
execute('SADD',f'processed_docs_stage3_tokenized{hash_tag}', token_key)
查看 GitHub 上的完整代码。
然后,对于每个 Redis 集群分片,我们通过下载、导出到 torchscript,然后将其加载到每个分片中来预加载 BERT QA 模型
def load_bert():
model_file = 'traced_bert_qa.pt'
with open(model_file, 'rb') as f:
model = f.read()
startup_nodes = [{"host": "127.0.0.1", "port": "30001"}, {"host": "127.0.0.1", "port":"30002"}, {"host":"127.0.0.1", "port":"30003"}]
cc = ClusterClient(startup_nodes = startup_nodes)
hash_tags = cc.execute_command("RG.PYEXECUTE", "gb = GB('ShardsIDReader').map(lambda x:hashtag()).run()")[0]
print(hash_tags)
for hash_tag in hash_tags:
print("Loading model bert-qa{%s}" %hash_tag.decode('utf-8'))
cc.modelset('bert-qa{%s}' %hash_tag.decode('utf-8'), 'TORCH', 'CPU', model)
print(cc.infoget('bert-qa{%s}' %hash_tag.decode('utf-8')))
当用户提出问题时,我们在运行 RedisAI 模型之前,将问题标记化并追加到潜在答案列表中
token_key = f"tokenized:bert:qa:{sentence_key}"
# encode question
input_ids_question = tokenizer.encode(question, add_special_tokens=True, truncation=True, return_tensors="np")
t=redisAI.getTensorFromKey(token_key)
input_ids_context=to_np(t,np.int64)
# merge (append) with potential answer, context - is pre-tokenized paragraph
input_ids = np.append(input_ids_question,input_ids_context)
attention_mask = np.array([[1]*len(input_ids)])
input_idss=np.array([input_ids])
num_seg_a=input_ids_question.shape[1]
num_seg_b=input_ids_context.shape[0]
token_type_ids = np.array([0]*num_seg_a + [1]*num_seg_b)
# create actual model runner for RedisAI
modelRunner = redisAI.createModelRunner(f'bert-qa{hash_tag}')
# make sure all types are correct
input_idss_ts=redisAI.createTensorFromBlob('INT64', input_idss.shape, input_idss.tobytes())
attention_mask_ts=redisAI.createTensorFromBlob('INT64', attention_mask.shape, attention_mask.tobytes())
token_type_ids_ts=redisAI.createTensorFromBlob('INT64', token_type_ids.shape, token_type_ids.tobytes())
redisAI.modelRunnerAddInput(modelRunner, 'input_ids', input_idss_ts)
redisAI.modelRunnerAddInput(modelRunner, 'attention_mask', attention_mask_ts)
redisAI.modelRunnerAddInput(modelRunner, 'token_type_ids', token_type_ids_ts)
redisAI.modelRunnerAddOutput(modelRunner, 'answer_start_scores')
redisAI.modelRunnerAddOutput(modelRunner, 'answer_end_scores')
# run RedisAI model runner
res = await redisAI.modelRunnerRunAsync(modelRunner)
answer_start_scores=to_np(res[0],np.float32)
answer_end_scores = to_np(res[1],np.float32)
answer_start = np.argmax(answer_start_scores)
answer_end = np.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end],skip_special_tokens = True))
log("Answer "+str(answer))
return answer
查看 GitHub 上的完整代码。
执行 BERT QA API 调用所用的过程如下所示
在这里,我使用了 RedisGears 的两个酷炫功能:捕获键未命中事件,并使用 async/await 在每个分片上运行 RedisAI,而无需锁定主线程 - 这样 Redis 集群就可以继续为其他客户提供服务。为了进行基准测试,从 RedisAI 缓存响应已 禁用。如果您在第二次调用时获得了纳秒级的响应时间,而不是毫秒级的响应时间,请确保上面链接的代码行已注释掉。
运行基准测试的先决条件
假设您正在运行 Debian 或 Ubuntu,并且已安装 Docker 和 docker-compose(或者可以通过 conda 创建虚拟环境),请运行以下命令
git clone --recurse-submodules https://github.com/applied-knowledge-systems/the-pattern.git
cd the-pattern
./bootstrap_benchmark.sh
上面的命令应以对 qasearch API 的 curl 调用结束,因为基准测试已禁用 Redis 缓存。
接下来,像这样调用 curl
time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' https://localhost:8080/qasearch
预期以下输出,或基于您的运行时环境的类似输出
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Sun, 29 May 2022 12:05:39 GMT
Content-Type: application/json
Content-Length: 2120
Connection: keep-alive
{"links":[{"created_at":"2002","rank":13,"source":"C0001486","target":"C0152083"}],"results":[{"answer":"adenovirus","sentence":"The medium of 40 T150 flasks of adenovirus transducer dec CAR CHO cells yielded 0 5 1 my of purified msCEACAM1a 1 4 protein","sentencekey":"sentence:PMC125375.xml:{mG}:202","title":"Crystal structure of murine sCEACAM1a[1,4]: a coronavirus receptor in the CEA family"}] OUTPUT_REDUCTED}
我修改了 API 的输出,以便基准测试可以返回来自所有分片的结果 - 即使答案为空。在上面的运行中,五个分片返回了答案。由于所有额外的跳跃都需要在 RedisGraph 中进行搜索,因此总体 API 调用响应时间不到一秒钟!
让我们更深入地了解幕后发生了什么
您应该有一个带有分片 ID 的句子键,可以通过查看 docker logs -f rgcluster
中的“缓存键”获得。在我的设置中,缓存键为“bertqa{6fd}_PMC169038.xml:{6fd}:33_谁在成年人中传播病毒”。如果您认为它看起来像函数调用,那是因为它是函数调用。如果 Redis 集群中不存在该键,它将被触发,对于基准测试来说,每次都会发生这种情况,因为您记得我们禁用了缓存输出。
还有一件事需要从日志中找出,即与标签对应的分片的端口,也称为 分片 ID
。它是花括号之间找到的文本 - 看起来像 {6fd}
。在 export_load
脚本的输出中也会出现同样的情况。在我的例子中,缓存键在“30012.log”中找到,所以我的端口是 30012。
接下来,我运行以下命令
redis-cli -c -p 300012 -h 127.0.0.1 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
然后运行基准测试
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 0.04 seconds
50 parallel clients
3 bytes payload
keep alive: 1
10.00% <= 41 milliseconds
100.00% <= 41 milliseconds
238.10 requests per second
如果您想知道,-n
= 次数。在这种情况下,我们运行基准测试 10 次。您也可以添加
– csv
如果您想以 CSV 格式输出
– precision 3
如果您想在 ms 中使用更多小数位
有关基准测试工具的更多信息,请访问 redis.io 基准测试页面。
如果您没有在本地安装 redis-utils,您可以使用 Docker,如下所示
docker exec -it rgcluster /bin/bash
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 1.75 seconds
50 parallel clients
99 bytes payload
keep alive: 1
host configuration "save":
host configuration "appendonly": no
multi-thread: no
Latency by percentile distribution:
0.000% <= 243.711 milliseconds (cumulative count 1)
50.000% <= 987.135 milliseconds (cumulative count 5)
75.000% <= 1577.983 milliseconds (cumulative count 8)
87.500% <= 1662.975 milliseconds (cumulative count 9)
93.750% <= 1744.895 milliseconds (cumulative count 10)
100.000% <= 1744.895 milliseconds (cumulative count 10)
Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 244.223 milliseconds (cumulative count 1)
20.000% <= 409.343 milliseconds (cumulative count 2)
30.000% <= 575.487 milliseconds (cumulative count 3)
40.000% <= 821.247 milliseconds (cumulative count 4)
50.000% <= 987.135 milliseconds (cumulative count 5)
60.000% <= 1157.119 milliseconds (cumulative count 6)
70.000% <= 1497.087 milliseconds (cumulative count 7)
80.000% <= 1577.983 milliseconds (cumulative count 8)
90.000% <= 1662.975 milliseconds (cumulative count 9)
100.000% <= 1744.895 milliseconds (cumulative count 10)
Summary:
throughput summary: 5.73 requests per second
latency summary (msec):
avg min p50 p95 p99 max
1067.296 243.584 987.135 1744.895 1744.895 1744.895
该平台只有 20 篇文章和 8 个 Redis 节点(4 个主节点 + 4 个从节点),因此相关性将是错误的,并且它不需要太多内存。
现在让我们检查一下我们的 RedisAI 模型在 {6fd} 分片上运行需要多长时间:
127.0.0.1:30012> AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 8928136
13) "samples"
14) (integer) 58
15) "calls"
16) (integer) 58
17) "errors"
18) (integer) 0
bert-qa{6fd}
是实际(非常大)模型保存的键。 AI.INFO
命令提供了累计持续时间 8928136 微秒和 58 次调用,约等于每次调用 153 毫秒。
让我们通过重置统计信息然后重新运行基准测试来仔细检查一下。
首先,重置统计信息
127.0.0.1:30012> AI.INFO bert-qa{6fd} RESETSTAT
OK
127.0.0.1:30012> AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 0
13) "samples"
14) (integer) 0
15) "calls"
16) (integer) 0
17) "errors"
18) (integer) 0
然后,重新运行基准测试
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 1.78 seconds
50 parallel clients
99 bytes payload
keep alive: 1
host configuration "save":
host configuration "appendonly": no
multi-thread: no
Latency by percentile distribution:
0.000% <= 188.927 milliseconds (cumulative count 1)
50.000% <= 995.839 milliseconds (cumulative count 5)
75.000% <= 1606.655 milliseconds (cumulative count 8)
87.500% <= 1692.671 milliseconds (cumulative count 9)
93.750% <= 1779.711 milliseconds (cumulative count 10)
100.000% <= 1779.711 milliseconds (cumulative count 10)
Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 189.183 milliseconds (cumulative count 1)
20.000% <= 392.191 milliseconds (cumulative count 2)
30.000% <= 540.159 milliseconds (cumulative count 3)
40.000% <= 896.511 milliseconds (cumulative count 4)
50.000% <= 996.351 milliseconds (cumulative count 5)
60.000% <= 1260.543 milliseconds (cumulative count 6)
70.000% <= 1456.127 milliseconds (cumulative count 7)
80.000% <= 1606.655 milliseconds (cumulative count 8)
90.000% <= 1692.671 milliseconds (cumulative count 9)
100.000% <= 1779.711 milliseconds (cumulative count 10)
Summary:
throughput summary: 5.62 requests per second
latency summary (msec):
avg min p50 p95 p99 max
1080.454 188.800 995.839 1779.711 1779.711 1779.711
现在再次检查统计信息
AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 1767749
13) "samples"
14) (integer) 20
15) "calls"
16) (integer) 20
17) "errors"
18) (integer) 0
现在我们得到每次调用 88387.45 微秒,这非常快!此外,考虑到我们最初的每次调用时间为 10 秒,我认为 RedisAI 与 RedisGears 结合使用的优势非常明显。但是,其权衡是高内存使用率。
有许多方法可以优化此部署。例如,您可以添加 FP16 量化和 ONNX 运行时。如果您想尝试一下, 此脚本 将是一个不错的起点。
感谢 Mikhail Volkov的贡献,我们现在可以使用 Grafana 观察 RedisGears 和 RedisGraph 的吞吐量和内存消耗。当您克隆存储库时,它会启动 Grafana Docker,该 Docker 包含预构建的模板来监控 RedisCluster,包括 RedisGears 和 RedisAI,以及 Graph - Redis 与 RedisGraph 的结合。“The Pattern”仪表板提供了概述,其中包含您关心的所有主要基准指标: