字节跳动模型大规模部署实战_互联网资讯__分享互联网新鲜有趣好玩的事物-云易优站长平台

1. 布景介绍

在字节跳动，基于深度进修的应用各处开花，工程师存眷模子效果的同时也需要存眷线上办事一致性和性能，早期那凡是需要算法专家和工程专家分工合做并慎密共同来完成，那种形式存在比力高的 diff 排查验证等成本。

跟着 PyTorch/TensorFlow 框架的流行，深度进修模子训练和在线推理完成了同一，开发者仅需要存眷详细算法逻辑，挪用框架的 Python API 完成训练验证过程即可，之后模子能够很便利的序列化导出，并由同一的高性能 C++ 引擎完成推理工做。提拔了开发者训练到摆设的体验。

然而，完好的办事凡是还存在大量的预处置/后处置等营业逻辑，那类逻辑凡是是把各类输入颠末加工处置改变为 Tensor，再输入到模子，之后模子的输出 Tensor 再加工成目的格局，一些典型的场景如下：

Bert

Resnet

我们的目的就是为以上端到端的过程，供给主动化且同一的训练、推理计划，减轻人工开发推理过程、对齐 diff 等一系列问题，实现大规模的同一摆设计划。

2. 核心问题

PyTorch/TensorFlow 等框架相对已经处理了模子的训练/推理同一的问题，因而模子计算自己不存在训推一体的问题了（算子性能优化不在本次讨论范畴）。

核心要处理的问题就是：预处置和后处置需要供给高性能训推一体的计划。

关于此类逻辑，TensorFlow 2.x 供给了 tf.function(还不完美)，PyTorch 供给了 TorchScript，其无一破例都是选择了原生 Python 语法子集。但即便强大如斯，仍然存在不成忽略的问题：

性能：此计划大多基于虚拟机实现，虚拟机计划灵敏而且十分可控，但深度进修框架中的虚拟机大多凡是性能不敷优良。弥补申明一下，框架早期都是为 Tensor 计算设想，数组计算每个算子成本很高，虚拟机的派发和调度成本能够忽略。但是，移植到法式语言编程层面开销难以忽略，代码写多了就会成为性能瓶颈。据测试，TorchScript 解释器性能只要 Python 的 1/5 摆布，tf.function 性能更差一些。功用不全：事实上应用到实在场景中，我们仍然能够找出良多 tf.function/TorchScript 不撑持的重要功用，好比：自定义的资本不克不及打包，只能序列化内置类型；字符串只能做 bytes 处置，中文等 unicode 会形成 diff；容器必需同构，不撑持自定义类型等等...

再者，还有良多非深度进修使命，好比在天然语言处置中仍然有良多非深度进修的应用或者子使命，如序列标注，语言模子解码，示范型的人工特征构造等使命，那些凡是具有更灵敏的特征范式，但同时都没有完好实现端到端的训推一体计划，仍然有大量的开发以及准确性校验工做。

为领会决上述问题，我们开发了一套基于编译的预处置计划：MATXScript！

3. MATXScript

在深度进修算法开发中，开发者凡是利用 Python 停止快速迭代和尝试，同时利用 C++ 开发高性能的线上办事，此中准确性校验和办事开发城市成为较重承担！

MatxScript（https://github.com/bytedance/matxscript）是一个 Python 子语言的 AOT 编译器，能够主动化将 Python 翻译成 C++，并供给一键打包发布功用。利用 MATXScript 能够闪开发者快速停止模子迭代的同时以较低成本完成高性能办事的摆设。

核心架构如下：

更底层是纯 C++/CUDA 的根底库，由高性能算子专家开发。在根底库之上，准守约定封拆出来 Python 的库，能够用在 training 过程中。需要 inferencing 时，操纵 MATXScript 能够把 Python 代码，翻译成对等的 C++ 代码，编译成动态链接库，加上模子及其他依赖的资本，一路打包发布即可。

此中，编译器感化十分关键，其核心流程如下：

通过以上流程，用户所编写的预处置代码，能够被编译成 PiPEline 中的一个 JitOp，为了把前后处置和模子联动，我们还开发了 tracing 系统（接口设想上参考了 PyTorch），架构如下：

基于 MATXScript，我们能够训练和推理利用统一套代码，大大降低了模子摆设的成本。同时，架构和算法得到领会耦，算法同窗完全利用 Python 工做即可，架构同窗专注于编译器开发及 Runtime 优化，在字节跳动，此计划得到了大规模摆设验证！

4. 小试牛刀

此处以最简单的英文文本预处置为例，展现一下 MATXScript 若何利用。

目的：把一段英文文本转成 indexes

1.编写一个根本的查字典的逻辑

class Text2Ids: def __init__(self) -> None: self.table: Dict[str, int] = { "hello": 0, "world": 1, "[UNK]": 2, } def lookup(self, word: str) -> int: return self.table.get(word, 2) def __call__ (self, words: List[str]) -> List[int]: return [self.lookup(w) for w in words]

2.编写 Pipeline

import matxclass WorkFlow: def __init__(self): # 此处会停止代码编译，Python 代码主动编译封拆为 Callable 对象 self.text2ids = matx.script(Text2Ids)() def PRocess(self, texts): ids = self.text2ids(texts) return ids# testhandler = WorkFlow()print(handler.process("hello world unknown"))# output: [0, 1, 2]

3.Trace 导出到磁盘

# dumpmod = matx.trace(handler.process, "hello world")print(mod.run({"texts": "hello world"}))mod.save('./my_dir')# loadmod = matx.load('./my_dir', -1)print(mod.run({"texts": "hello world"}))

4.C++ 加载

#include <string>#include <vector>#include <map>#include <iostream>#include <matxscript/pipeline/tx_session.h>using namespace ::matxscript::runtime;int main(){ // test case std::unordered_map<std::string, RTValue> feed_dict; feed_dict.emplace("texts", Unicode(U"hello world")); std::vector<std::pair<std::string, RTValue>> result; const char* module_path = "./my_dir"; const char* module_name = "model.spec.json"; { // -1 mean cpu auto sess = TXSession::Load(module_path, module_name, -1); auto result = sess->Run(feed_dict); for (auto& r : result) { std::cout << "key: " << r.first << ", value: " << r.second << std::endl; } } return 0;}

完好的代码见：https://github.com/bytedance/matxscript/tree/main/examples/text2ids

小结：以上是一个十分简单的纯 Python 实现的预处置逻辑，且能被一段通用的 C++ 代码加载运行，下面我们连系模子展现一个现实的多模态端到端案例！

5. 多模态案例

此处以图文多模态(Bert+Resnet)为例，模子利用 PyTorch 编写，展现训练和摆设中现实的工做。

设置装备摆设情况a. 设置装备摆设 gcc/cuda 等根底设备（凡是是运维同窗已经搞定）b. 安拆 MATXScript 及基于此开发的根底库(text、vision等)编写模子代码a. 此处省略，各人能够参考论文或其他开源实现自行搞定编写预处置代码

a. text

from typing import List, Dict, Tupleimport libcutimport matxclass Vocabulary: ...def utf8_decoder(s: List[bytes]): return [x.decode() for x in s]class TextNDArrayBuilder: ...class TextPipeline: def __init__(self, mode: str = "eval"): self.mode = mode self.cut_engine = libcut.Cutter('/path/to/cut_models', ...) self.vocab = matx.script(Vocabulary)('/path/to/vocab.txt') self.decoder = matx.script(utf8_decoder) self.input_builder = matx.script(TextNDArrayBuilder)(self.vocab) def process(self, text: List[bytes]): # List[bytes] 是对齐 C++ 的 vector<string> text: List[str] = self.decoder(text) words: List[List[str]] = self.cut_engine(text) batch_ids: List[List[int]] = self.vocab(words) input_ids, segment_ids, mask_ids = self.input_builder(batch_ids, 32) if self.mode == "train": return input_ids.torch(), segment_ids.torch(), mask_ids.torch() return input_ids, segment_ids, mask_ids

b. vision

from typing import List, Dict, Tupleimport matxfrom matx import visionclass VisionPipeline: def __init__(self, device_id: int = 0, mode: str = "eval", image_size: int = 224,): self.is_training = mode == 'train' self.mode = mode ... def process(self, image,): if self.is_training: decode_nds = self.random_crop_decode(image) flip_nds = self.random_flip(decode_nds) resize_nds = self.resize(flip_nds) transpose_nd = self.transpose_norm(resize_nds, vision.SYNC) else: decode_nds = self.decode(image) resize_nds = self.resize(decode_nds) crop_nds = self.center_crop(resize_nds) transpose_nd = self.transpose_norm(crop_nds, vision.SYNC) if self.mode == "trace": return transpose_nd return transpose_nd.torch()

4.接入 DataLoadera. TextPipeline 能够当成一个一般的 Python Class 接入 Dataset 即可b. VisionPipeline 涉及到 GPU 预处置，更合适按 batch 停止处置，需要本身零丁构造一个 DataLoader（那里埋个点，之后会开源字节跳动内部基于多线程的 DataLoader）

5.加上模子代码，起头训练吧

6.导出端到端的 Inference Model

class MultimodalEvalPipeline: def __init__(self): self.text_pipe = TextPipeline(mode="eval", ...) self.vision_pipe = VisionPipeline(mode="eval", ...) self.torch_model = torch.jit.load('/path/to/multimodal.jit', map_location='cuda:0') self.tx_model_op = matx.script(self.torch_model, device=0) def eval(self, texts: List[bytes], images: List[bytes]) -> List[float]: input_ids, segment_ids, mask_ids = self.text_pipe.process(texts) images = self.vision_pipe.process(images) scores = self.tx_model_op(input_ids, segment_ids, mask_ids, images) return scores# examplesexample_batch_size = 8text_examples = ['hello, world'.encode()] * example_batch_sizewith open('/path/image.jpg', 'rb') as f: image_example = f.read()image_examples = [image_example] * example_batch_size# pipeline instancepipe = MultimodalEvalPipeline(...)mod = matx.trace(pipe.eval, text_examples, image_examples)# testprint(mod.run({"texts": text_examples, "images": image_examples}))# savemod.save('/path/to/my_multimodal')

小结：颠末以上步调，我们即可完成端到端的训练&发布工做，且整个过程是纯 Python 代码完成的，能够完全由算法同窗本身控造。当然，若是模子计算自己还有性能问题，也是能够在背后通过主动改图优化工做完成。

注：完好代码示例见 https://github.com/bytedance/matxscript/tree/main/examples/e2e_multi_modal

6. 同一Server

在上个章节，我们得到了一个算法同窗发布的模子包，本章节阐述若是用同一的办事停止加载和运行。

完好的 Server 包罗：IDL 协议、Batching 战略、进/线程调度和排布、模子推理...

那里，我们只讨论模子推理那块，其他的都是能够按约定开发即可。我们以一个 main 函数来示例模子加载和运行的过程：

#include <string>#include <vector>#include <map>#include <iostream>#include <matxscript/pipeline/tx_session.h>using namespace ::matxscript::runtime;int main(){ // test case std::unordered_map<std::string, RTValue> feed_dict; feed_dict.emplace("texts", List({String("hello world")})); feed_dict.emplace("images", List({String("......")})); std::vector<std::pair<std::string, RTValue>> result; const char* module_path = "/path/to/my_multimodal"; const char* module_name = "model.spec.json"; { // cuda:0 auto sess = TXSession::Load(module_path, module_name, 0); auto result = sess->Run(feed_dict); for (auto& r : result) { std::cout << "key: " << r.first << ", value: " << r.second << std::endl; } } return 0;}

以上代码就是最简单的一个 C++ 加载多模态模子的案例，对 Server 开发的同窗来说，只需停止简单的笼统和约定，即可把上述代码革新成一个同一的 C++ 模子办事框架。

7. 更多信息

我们是字节跳动-AML-机器进修系统团队，努力于为公司供给同一的高性能训推一体化框架，同时也会通偏激山引擎机器进修平台办事于合做企业，火山引擎机器进修平台估计 2023 年起供给 MATX 的相关撑持，包罗预置镜像情况、常用场景的公开样例、企业接入和利用过程中的手艺保障等，能够到达训练和推理场景低成本加速和一体化的效果。欢送在 https://www.volcengine.com/product/ml-platform 详细领会我们的产物。

更多信息，可拜候 Github：https://github.com/bytedance/matxscript