在本文中,我们将探讨部署来自 Hugging Face 的 大型 BERT 问答 Transformer 模型相关的挑战和机遇,利用 RedisGears 和 RedisAI 完成大部分繁重工作,同时利用内存数据存储 Redis 的优势。
在数据科学工作负载中
然而,在面向客户端的应用中
在我们深入之前,**为什么您应该阅读本文?** 以下是一些激发灵感的数字
首先
python3 transformers_plain_bert_qa.py
airborne transmission of respiratory infections is the lack of established methods for the detection of airborne respiratory microorganisms
10.351818372 seconds
上述脚本使用了默认 BERT QA 管道中略微修改过的 transformer,并在服务器上运行需要 10 秒。服务器使用最新的第 12 代 Intel(R) Core(TM) i9-12900K,完整的 cpuinfo 信息。
然而
time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' https://:8080/qasearch
real 0m0.747s
user 0m0.004s
sys 0m0.000s
该脚本在**每个分片**上运行 BERT QA 推理,默认情况下等于可用 CPU 的数量,并在不到一秒的时间内返回答案。
难以置信,对吗? 让我们深入了解!
在 BERT QA 管道(或任何其他现代 NLP 推理任务)中,有两个步骤
使用 Redis,我们有机会预计算所有内容并将其存储在内存中,但如何实现呢?与摘要机器学习任务不同,问题是预先未知的,因此我们无法预计算所有可能的答案。然而,我们可以使用 RedisGears 预分词所有潜在答案(即数据集中的所有段落)
def parse_sentence(record):
import redisAI
import numpy as np
global tokenizer
if not tokenizer:
tokenizer=loadTokeniser()
hash_tag="{%s}" % hashtag()
for idx, value in sorted(record['value'].items(), key=lambda item: int(item[0])):
tokens = tokenizer.encode(value, add_special_tokens=False, max_length=511, truncation=True, return_tensors="np")
tokens = np.append(tokens,tokenizer.sep_token_id).astype(np.int64)
tensor=redisAI.createTensorFromBlob('INT64', tokens.shape, tokens.tobytes())
key_prefix='sentence:'
sentence_key=remove_prefix(record['key'],key_prefix)
token_key = f"tokenized:bert:qa:{sentence_key}:{idx}"
redisAI.setTensorInKey(token_key, tensor)
execute('SADD',f'processed_docs_stage3_tokenized{hash_tag}', token_key)
然后对于每个 Redis 集群分片,我们通过下载、导出为 torchscript,然后将其加载到每个分片中来预加载 BERT QA 模型
def load_bert():
model_file = 'traced_bert_qa.pt'
with open(model_file, 'rb') as f:
model = f.read()
startup_nodes = [{"host": "127.0.0.1", "port": "30001"}, {"host": "127.0.0.1", "port":"30002"}, {"host":"127.0.0.1", "port":"30003"}]
cc = ClusterClient(startup_nodes = startup_nodes)
hash_tags = cc.execute_command("RG.PYEXECUTE", "gb = GB('ShardsIDReader').map(lambda x:hashtag()).run()")[0]
print(hash_tags)
for hash_tag in hash_tags:
print("Loading model bert-qa{%s}" %hash_tag.decode('utf-8'))
cc.modelset('bert-qa{%s}' %hash_tag.decode('utf-8'), 'TORCH', 'CPU', model)
print(cc.infoget('bert-qa{%s}' %hash_tag.decode('utf-8')))
当用户提出问题时,我们在运行 RedisAI 模型之前对问题进行分词并将其附加到潜在答案列表中
token_key = f"tokenized:bert:qa:{sentence_key}"
# encode question
input_ids_question = tokenizer.encode(question, add_special_tokens=True, truncation=True, return_tensors="np")
t=redisAI.getTensorFromKey(token_key)
input_ids_context=to_np(t,np.int64)
# merge (append) with potential answer, context - is pre-tokenized paragraph
input_ids = np.append(input_ids_question,input_ids_context)
attention_mask = np.array([[1]*len(input_ids)])
input_idss=np.array([input_ids])
num_seg_a=input_ids_question.shape[1]
num_seg_b=input_ids_context.shape[0]
token_type_ids = np.array([0]*num_seg_a + [1]*num_seg_b)
# create actual model runner for RedisAI
modelRunner = redisAI.createModelRunner(f'bert-qa{hash_tag}')
# make sure all types are correct
input_idss_ts=redisAI.createTensorFromBlob('INT64', input_idss.shape, input_idss.tobytes())
attention_mask_ts=redisAI.createTensorFromBlob('INT64', attention_mask.shape, attention_mask.tobytes())
token_type_ids_ts=redisAI.createTensorFromBlob('INT64', token_type_ids.shape, token_type_ids.tobytes())
redisAI.modelRunnerAddInput(modelRunner, 'input_ids', input_idss_ts)
redisAI.modelRunnerAddInput(modelRunner, 'attention_mask', attention_mask_ts)
redisAI.modelRunnerAddInput(modelRunner, 'token_type_ids', token_type_ids_ts)
redisAI.modelRunnerAddOutput(modelRunner, 'answer_start_scores')
redisAI.modelRunnerAddOutput(modelRunner, 'answer_end_scores')
# run RedisAI model runner
res = await redisAI.modelRunnerRunAsync(modelRunner)
answer_start_scores=to_np(res[0],np.float32)
answer_end_scores = to_np(res[1],np.float32)
answer_start = np.argmax(answer_start_scores)
answer_end = np.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end],skip_special_tokens = True))
log("Answer "+str(answer))
return answer
进行 BERT QA API 调用的过程如下所示
这里我使用了 RedisGears 的两个很酷的功能:捕获 key miss 事件和使用 async/await 在每个分片上运行 RedisAI 而不阻塞主线程 - 这样 Redis Cluster 就可以继续为其他客户提供服务。对于基准测试,RedisAI 的响应缓存是禁用的。如果您在第二次调用时获得的响应时间是纳秒而不是毫秒,请检查确保上面链接的那一行已注释掉。
运行基准测试的先决条件
假设您正在运行 Debian 或 Ubuntu 并且已安装 Docker 和 docker-compose(或可以通过 conda 创建虚拟环境),运行以下命令
git clone --recurse-submodules https://github.com/applied-knowledge-systems/the-pattern.git
cd the-pattern
./bootstrap_benchmark.sh
上述命令应该以对 qasearch API 的 curl 调用结束,因为基准测试已禁用 Redis 缓存。
接下来,像这样调用 curl
time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' https://:8080/qasearch
预期看到以下输出,或基于您的运行时环境的类似输出
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Sun, 29 May 2022 12:05:39 GMT
Content-Type: application/json
Content-Length: 2120
Connection: keep-alive
{"links":[{"created_at":"2002","rank":13,"source":"C0001486","target":"C0152083"}],"results":[{"answer":"adenovirus","sentence":"The medium of 40 T150 flasks of adenovirus transducer dec CAR CHO cells yielded 0 5 1 my of purified msCEACAM1a 1 4 protein","sentencekey":"sentence:PMC125375.xml:{mG}:202","title":"Crystal structure of murine sCEACAM1a[1,4]: a coronavirus receptor in the CEA family"}] OUTPUT_REDUCTED}
我修改了基准测试 API 的输出,使其返回所有分片的结果 - 即使答案为空。在上面的运行中,有五个分片返回了答案。考虑到所有额外的 RedisGraph 搜索跳转,整个 API 调用响应时间不到一秒!
让我们更深入地了解内部发生的情况
您应该有一个带有分片 ID 的句子 key,可以通过查看 docker logs -f rgcluster
的“缓存 key”获取。在我的设置中,缓存 key 是,“bertqa{6fd}
PMC169038.xml:{6fd}
:33_Who performs viral transmission among adults”。如果您认为它看起来像一个函数调用,那是因为它确实是一个函数调用。如果 Redis Cluster 中不存在该 key,就会触发它,对于基准测试来说,每次都会触发,因为如果您还记得,我们禁用了输出缓存。_从日志中需要弄清楚的另一件事是与井号对应的分片端口,也称为分片 ID
。它是大括号之间的文本 – 看起来像 {6fd}
上面。在 export_load
脚本的输出中也会找到相同的信息。在我的例子中,缓存 key 在“30012.log”中找到,所以我的端口是 30012。
接下来我运行以下命令
然后运行基准测试
redis-cli -c -p 300012 -h 127.0.0.1 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
如果您想知道,-n
= 次数。在这种情况下,我们运行基准测试 10 次。您还可以添加
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 0.04 seconds
50 parallel clients
3 bytes payload
keep alive: 1
10.00% <= 41 milliseconds
100.00% <= 41 milliseconds
238.10 requests per second
– csv
如果您想以 CSV 格式输出
– precision 3
如果您希望毫秒值有更多小数位
关于基准测试工具的更多信息可在 redis.io 基准测试页面上找到。
如果您没有在本地安装 redis-utils,可以使用 Docker 如下所示
该平台只有 20 篇文章和 8 个 Redis 节点(4 个主节点 + 4 个从节点),因此相关性会出错,并且它不需要大量内存。
docker exec -it rgcluster /bin/bash
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 1.75 seconds
50 parallel clients
99 bytes payload
keep alive: 1
host configuration "save":
host configuration "appendonly": no
multi-thread: no
Latency by percentile distribution:
0.000% <= 243.711 milliseconds (cumulative count 1)
50.000% <= 987.135 milliseconds (cumulative count 5)
75.000% <= 1577.983 milliseconds (cumulative count 8)
87.500% <= 1662.975 milliseconds (cumulative count 9)
93.750% <= 1744.895 milliseconds (cumulative count 10)
100.000% <= 1744.895 milliseconds (cumulative count 10)
Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 244.223 milliseconds (cumulative count 1)
20.000% <= 409.343 milliseconds (cumulative count 2)
30.000% <= 575.487 milliseconds (cumulative count 3)
40.000% <= 821.247 milliseconds (cumulative count 4)
50.000% <= 987.135 milliseconds (cumulative count 5)
60.000% <= 1157.119 milliseconds (cumulative count 6)
70.000% <= 1497.087 milliseconds (cumulative count 7)
80.000% <= 1577.983 milliseconds (cumulative count 8)
90.000% <= 1662.975 milliseconds (cumulative count 9)
100.000% <= 1744.895 milliseconds (cumulative count 10)
Summary:
throughput summary: 5.73 requests per second
latency summary (msec):
avg min p50 p95 p99 max
1067.296 243.584 987.135 1744.895 1744.895 1744.895
AI.INFO#
bert-qa{6fd}
是实际保存的(非常大的)模型的 key。AI.INFO
命令给出了 8928136 微秒的累计持续时间和 58 次调用,每次调用大约 153 毫秒。
127.0.0.1:30012> AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 8928136
13) "samples"
14) (integer) 58
15) "calls"
16) (integer) 58
17) "errors"
18) (integer) 0
让我们通过重置统计信息然后重新运行基准测试来再次确认是否正确。
首先,重置统计信息
然后,重新运行基准测试
127.0.0.1:30012> AI.INFO bert-qa{6fd} RESETSTAT
OK
127.0.0.1:30012> AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 0
13) "samples"
14) (integer) 0
15) "calls"
16) (integer) 0
17) "errors"
18) (integer) 0
现在再次检查统计信息
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
10 requests completed in 1.78 seconds
50 parallel clients
99 bytes payload
keep alive: 1
host configuration "save":
host configuration "appendonly": no
multi-thread: no
Latency by percentile distribution:
0.000% <= 188.927 milliseconds (cumulative count 1)
50.000% <= 995.839 milliseconds (cumulative count 5)
75.000% <= 1606.655 milliseconds (cumulative count 8)
87.500% <= 1692.671 milliseconds (cumulative count 9)
93.750% <= 1779.711 milliseconds (cumulative count 10)
100.000% <= 1779.711 milliseconds (cumulative count 10)
Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 189.183 milliseconds (cumulative count 1)
20.000% <= 392.191 milliseconds (cumulative count 2)
30.000% <= 540.159 milliseconds (cumulative count 3)
40.000% <= 896.511 milliseconds (cumulative count 4)
50.000% <= 996.351 milliseconds (cumulative count 5)
60.000% <= 1260.543 milliseconds (cumulative count 6)
70.000% <= 1456.127 milliseconds (cumulative count 7)
80.000% <= 1606.655 milliseconds (cumulative count 8)
90.000% <= 1692.671 milliseconds (cumulative count 9)
100.000% <= 1779.711 milliseconds (cumulative count 10)
Summary:
throughput summary: 5.62 requests per second
latency summary (msec):
avg min p50 p95 p99 max
1080.454 188.800 995.839 1779.711 1779.711 1779.711
现在我们每次调用只需 88387.45 微秒,非常快!而且,考虑到我们最初每次调用需要 10 秒,我认为结合使用 RedisAI 和 RedisGears 的好处是显而易见的。然而,权衡之下是较高的内存使用。
AI.INFO bert-qa{6fd}
1) "key"
2) "bert-qa{6fd}"
3) "type"
4) "MODEL"
5) "backend"
6) "TORCH"
7) "device"
8) "CPU"
9) "tag"
10) ""
11) "duration"
12) (integer) 1767749
13) "samples"
14) (integer) 20
15) "calls"
16) (integer) 20
17) "errors"
18) (integer) 0
最后更新于