OnnxRuntime

发表于 2022-11-13 更新于 2022-12-18 分类于 blog 阅读数：评论数：本文字数： 20k 阅读时长 ≈ 33 分钟

关于OnnxRuntime推理框架的介绍到框架的源码剖析以及项目中使用onnxruntime部署遇到的问题总结

OnnxRuntime 简介

ONNX自带了 Runtime 库，能够将 ONNX IR 部署到不同的硬件设备上进行推理，支持各种后端（如 CUDAExecutionProvider/CPUExecutionProdiver/TvmExecutionProvider）。

基于 ONNX Model的 Runtime 系统架构¹如下，可以看到 Runtime 实现功能是将 ONNX Model 转换为计算图，之后通过将其转化为各个可执行的子图，最后通过 GetCapability() API将子图分配到不同的后端（execution provider, EP）执行。

onnxruntime_1

已支持的Execution Providers

ONNXRuntime 现在已经支持多种不同的 EP（Execution Provider）。一些 EP 已经在上线并应用到实际生产中，还有一些 EP 则以预览形式发布。开发人员可以通过配置选项选择不同的 EP 进行推理。

例如，我们可以选择用 CUDAExecutionProvider 推理，实现如下：

import onnxruntime as ort
providers = [
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
        'do_copy_in_default_stream': True,
    })
]
session = ort.InferenceSession(model_path, providers=providers)

当前已经支持/正在支持的 EP 总结如下：

CPU	GPU	IoT/Edge/Mobile	Other
Default CPU	NVIDIA CUDA	Intel OpenVINO	Rockchip NPU (preview)
Intel DNNL	NVIDIA TensorRT	ARM Compute Library (preview)	Xilinx Vitis-AI (preview)
TVM (preview)	DirectML	Android Neural Networks API	Huawei CANN (preview)
Intel OpenVINO	AMD MIGraphX (preview)	ARM-NN (preview)
	AMD ROCm (preview)	CoreML (preview)
	TVM (preview)	TVM (preview)
	Intel OpenVINO	Qualcomm SNPE
XNNPACK		XNNPACK

我们实际项目中常用的是 CPUExecutionProvider 和 CUDAExecutionProvider。当然，ONNXRuntime 也是支持自定义 EP的，如果项目需要自定义 EP，可以参考如何添加一个EP。

Runtime源码剖析

onnx中如何绑定C与python接口的

onnx_runtime/onnxruntime/python/onnxruntime_pybind_state.h中声明了三个接口，这个三个接口实现C与python接口的绑定

#include "onnxruntime_pybind.h"  // must use this for the include of <pybind11/pybind11.h>

namespace onnxruntime {
namespace python {

void addGlobalMethods(py::module& m, Environment& env);
void addObjectMethods(py::module& m, Environment& env);
void addOrtValueMethods(pybind11::module& m);

}  // namespace python
}  // namespace onnxruntime


// onnxruntime/onnxruntime/python/onnxruntime_pybind_module.cc中通过

#include "onnxruntime_pybind.h"  // must use this for the include of <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "core/providers/get_execution_providers.h"

namespace onnxruntime {
namespace python {
namespace py = pybind11;

void CreateInferencePybindStateModule(py::module& m);

PYBIND11_MODULE(onnxruntime_pybind11_state, m) {
  CreateInferencePybindStateModule(m);
  // move it out of shared method since training build has a little different behavior.
  m.def(
      "get_available_providers", []() -> const std::vector<std::string>& { return GetAvailableExecutionProviderNames(); },
      "Return list of available Execution Providers in this installed version of Onnxruntime. "
      "The order of elements represents the default priority order of Execution Providers "
      "from highest to lowest.");
}
}  // namespace python
}  // namespace onnxruntime

InferenceSession中调用构造函数，但是没有这个类型的构造函数

return onnxruntime::make_unique<InferenceSession>(so, arg, SessionObjectInitializer::Get());

  /**
    Create a new InferenceSession
    @param session_options Session options.
    @param model_uri absolute path of the model file.
    @param logging_manager
    Optional logging manager instance that will enable per session logger output using
    session_options.session_logid as the logger id in messages.
    If nullptr, the default LoggingManager MUST have been created previously as it will be used
    for logging. This will use the default logger id in messages.
    See core/common/logging/logging.h for details, and how LoggingManager::DefaultLogger works.
    This ctor will throw on encountering model parsing issues.
    */
  InferenceSession(const SessionOptions& session_options,
                   const std::string& model_uri,
                   logging::LoggingManager* logging_manager = nullptr);

SessionObjectInitializer::Get()返回一个SessionObjectInitializer类型的对象，这个对象不得了，开发者给他订一了两个类型转换函数，

SessionObjectInitializer -> const SesssionOptions& Args1
SessionObjectInitializer->logging::LoggingManager*

这样编译器就可以利用这两个类型转换函数来实现类型的隐式转换而没有报类型错误。

class SessionObjectInitializer {
 public:
  typedef const SessionOptions& Arg1;
  typedef logging::LoggingManager* Arg2;
  operator Arg1() {
    return GetDefaultCPUSessionOptions();
  }

  operator Arg2() {
    static std::string default_logger_id{"Default"};
    static LoggingManager default_logging_manager{std::unique_ptr<ISink>{new CErrSink{}},
                                                  Severity::kWARNING, false, LoggingManager::InstanceType::Default,
                                                  &default_logger_id};
    return &default_logging_manager;
  }

  static SessionObjectInitializer Get() {
    return SessionObjectInitializer();
  }
};

终于找到真身了‼️

model_path初始化
model_proto初始化
构建session

InferenceSession::InferenceSession(const SessionOptions& session_options,
                                   const std::string& model_uri,
                                   logging::LoggingManager* logging_manager)
    : insert_cast_transformer_("CastFloat16Transformer") {
  model_location_ = ToWideString(model_uri);
  model_proto_ = onnxruntime::make_unique<ONNX_NAMESPACE::ModelProto>();
  auto status = Model::Load(model_location_, *model_proto_);
  ORT_ENFORCE(status.IsOK(), "Given model could not be parsed while creating inference session. Error message: ",
              status.ErrorMessage());

  // Finalize session options and initialize assets of this session instance
  ConstructorCommon(session_options, logging_manager);
}

`Model::Load`加载模型的流程

不支持的算子如何处理
多个provider之间怎么分配运行图

性能优化

图优化技术

随着 ONNX 技术逐渐成为通用 IR，相关的优化工具层出不穷。其中为了部署性能，转换工具或者其他优化工具基本集成基本的图优化策略²，包括: 冗余节点消除，常量折叠和算子融合等等，本小节将针对这些相关优化策略进行总结。

常量折叠（Constant Folding）：静态计算，对仅依赖常量初始值的图结构进行预计算并优化
冗余节点消除（Redundant node eliminations）：在不改变图结构的情况下移除所有冗余节点
- Identity Elimination
- Slice Elimination
- Unsqueeze Elimination
- Dropout Elimination
算子融合（Semantics-preserving node fusions）：对相邻的的算子进行融合，减少 Kernel 加载的性能开销。主要针对常见的一些模式进行算子替换
- Conv Add Fusion
- Conv Mul Fusion
- Conv BatchNorm Fusion
- Relu Clip Fusion
- Reshape Fusion
布局优化（Layout Optimization）：改变适用节点的数据布局，以实现更高的性能改进。它们在图形分区之后运行，并且仅支持 CPUExecutionProvider
- NCHWc 优化器：通过使用 NCHWc 布局而不是 NCHW 布局来优化图形

拓展操作（Extended Graph Optimizations）：这类优化主要针对复杂、但常用的操作进行单独算子拓展实现，尤其是针对当下最流行的 Backbone/Embedding 等，这类算子对提升行业应用收效显著。

Optimization	Execution Provider	Comment
GEMM Activation Fusion	CPU
Matmul Add Fusion	CPU
Conv Activation Fusion	CPU
GELU Fusion	CPU, CUDA, ROCm
Layer Normalization Fusion	CPU, CUDA, ROCm	主要在 BERT 类序列任务中应用
BERT Embedding Layer Fusion	CPU, CUDA, ROCm	融合BERT embedding layer，layer normalization 和 attention mask length
Attention Fusion*	CPU, CUDA, ROCm	为了优化 BERT 的性能，在 CUDA 和 ROCm 执行提供程序的 GELU 近似和注意融合中使用了近似。
Skip Layer Normalization Fusion	CPU, CUDA, ROCm	融合带偏置的全连接层、skip connection和layer normalization
Bias GELU Fusion	CPU, CUDA, ROCm	融合带偏置的全连接和GELU激活层
GELU Approximation*	CUDA, ROCm	为了优化 BERT 的性能，在 CUDA 和 ROCm 执行提供程序的 GELU 近似和注意融合中使用了近似。Disabled by default. Enable with kOrtSessionOptionsEnableGeluApproximation

ONNXRuntime 定义了 GraphOptimizationLevel 枚举类型以确定将启用上述哪些优化级别。选择一个级别可以启用该级别的优化，以及所有先前级别的优化。例如，启用扩展优化，也会启用基本优化。这些级别到枚举的映射如下：

GraphOptimizationLevel::ORT_DISABLE_ALL: 关掉所有的图优化策略
GraphOptimizationLevel::ORT_ENABLE_BASIC: 打开基本优化
GraphOptimizationLevel::ORT_ENABLE_EXTENDED: 打开基本优化和拓展优化
GraphOptimizationLevel::ORT_ENABLE_ALL: 打开所有优化，包括基本优化/拓展优化/Layout优化

性能瓶颈优化

使用 CUDAExecutionProvider 时，在进行模型推理（调用 InferenceSession.Run()）之前，如果输入/输出直接在目标设备（Device）上进行构造能达到性能更佳。当输入未拷贝到目标设备时，ORT 会将其从 CPU 复制到 Device，这也是在 InferenceSession.Run() 中执行。类似地，如果输出未在设备上预先构造分配，默认情况下，ORT 直接结果输出到 CPU。尤其是当业务 Pipeline 中是多个模型时候，大部分时间花在这些数据传输上时，很明显这会占用大量的执行时间，让人以为是 ORT 性能很糟糕。例如：

import onnxruntime as ort
import numpy as np
import torch

providers = ["CUDAExecutionProvider"]
session = ort.InferenceSession("model.onnx", providers=providers)
#  X is from other model output which is pytorch.Tensor
X = torch.empty(shape=(3, 224, 224), dtype=torch.float32, device=torch.device("cuda:0"))
io = {
    "X": X.detach().numpy().astype(np.float32)
}
Y = session.run(io)[0]

为了解决这个问题，ONNXRuntime 中使用 IOBinding 技术来提升性能。该技术主要思想是实现在调用 InferenceSession.Run() 之前将输入/输出在目标设备上进行分配构造，减少不必要的 IO 开销。以下是将在 GPU 上的 Tensor 直接绑定到 ORT 后的代码实现。

import onnxruntime as ort
import torch
import numpy as np

providers = ["CUDAExecutionProvider"]
session = ort.InferenceSession("model.onnx", providers=providers)
device_id = torch.cuda.current_device()
X = torch.empty(shape=(3, 224, 224), dtype=torch.float32, device=torch.device("cuda", device_id))
torch_y = torch.empty(
    shape=(5, 1),
    dtype=torch.float32,
    device=torch.device("cuda", device_id),
)
binding = session.io_binding()
# io = { "X": X.detach().numpy().astype(np.float32)}
binding.bind_input(
    name='X',
    device_type='cuda',
    device_id=device_id,
    element_type=np.int64,
    shape=tuple(X.shape),
    buffer_ptr=X.data_ptr(),
)
binding.bind_output(
    name='Y',
    device_type='cuda',
    device_id=device_id,
    element_type=np.int64,
    shape=tuple(torch_y.shape),
    buffer_ptr=torch_y.data_ptr(),
)
# Y = session.run(io)[0]
# Y = torch.from_numpy(Y).to(dtype=torch.float).cuda()
session.run_with_iobinding(binding)

模型部署调优

自带的转换和性能调优工具 OLive，采用启发式规则对搜索空间内参数进行搜索，获得特定硬件场景下的最佳性能组合，可以输出最少延迟或者最大吞吐性能的参数配置组合。

Execution Providers
- MLAS(default CPU EP), Intel DNNL and OpenVino for CPU
- NVIDIA CUDA and TensorRT for GPU
Environment Variables
- OMP_WAIT_POLICY
- OMP_NUM_THREADS
- KMP_AFFINITY
- OMP_MAX_ACTIVE_LEVELS
Session Options
- inter_op_num_threads
- intra_op_num_threads
- execution_mode
- graph_optimization_level

举一个例子

import onnxruntime
import numpy as np

model_path = "efficientnet_b3.onnx"
output_names = ['output']
dtype = np.float32
inputs = np.random.rand(1, 3, 224, 224).astype(dtype)
providers = [
    ('TensorrtExecutionProvider', {
        'device_id': 0,
        'trt_max_workspace_size': 2147483648,
        'trt_fp16_enable': True,
    }),
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
        'do_copy_in_default_stream': True,
    }),
]
options = onnxruntime.SessionOptions()
options.log_severity_level = 3 # Applies to session load, initialization, etc. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2.
# options.log_verbosity_level =0 # VLOG level if DEBUG build and session_log_severity_level is 0. Applies to session load, initialization, etc. Default is 0.
# options.enable_cpu_mem_arena = True
# options.enable_mem_pattern = True
# options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_EXTENDED   # Graph optimization level for this session.
options.inter_op_num_threads = 1 # Sets the number of threads used to parallelize the execution of the graph (across nodes). Default is 0 to let onnxruntime choose.
options.intra_op_num_threads = 1 # Sets the number of threads used to parallelize the execution within nodes. Default is 0 to let onnxruntime choose.
model = onnxruntime.InferenceSession(model_path, providers=providers, sess_options=options)
ort_inputs = {"input": inputs}
ort_outputs = [out.name for out in model.get_outputs()]
results = model.run(ort_outputs, ort_inputs)[0]

常见问题

01、`clang: error: argument unused during compilation: '-mfpu=neon' [-Werror,-Wunused-command-line-argument]`

这个应该是由于onnxruntime对MAC M1机型的兼容性的一个bug，当前我先临时将-mfpu=neon从CMAKE_CXX_FLAGS中移除，然后重新编译。

[ 20%] Building CXX object CMakeFiles/onnxruntime_common.dir/Users/xxx/Documents/Framework/onnxruntime/onnxruntime/core/common/cpuid_info.cc.o
clang: error: argument unused during compilation: '-mfpu=neon' [-Werror,-Wunused-command-line-argument]
make[2]: *** [CMakeFiles/onnxruntime_common.dir/Users/xxx/Documents/Framework/onnxruntime/onnxruntime/core/common/cpuid_info.cc.o] Error 1
make[1]: *** [CMakeFiles/onnxruntime_common.dir/all] Error 2
make: *** [all] Error 2
Traceback (most recent call last):
  File "/Users/xxx/Documents/Framework/onnxruntime/tools/ci_build/build.py", line 1065, in <module>
    sys.exit(main())
  File "/Users/xxx/Documents/Framework/onnxruntime/tools/ci_build/build.py", line 1002, in main
    build_targets(args, cmake_path, build_dir, configs, args.parallel)
  File "/Users/xxx/Documents/Framework/onnxruntime/tools/ci_build/build.py", line 471, in build_targets
    run_subprocess(cmd_args, env=env)
  File "/Users/xxx/Documents/Framework/onnxruntime/tools/ci_build/build.py", line 212, in run_subprocess
    completed_process = subprocess.run(args, cwd=cwd, check=True, stdout=stdout, stderr=stderr, env=my_env, shell=shell)
  File "/Users/xxx/mambaforge/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/homebrew/Cellar/cmake/3.23.2/bin/cmake', '--build', '/Users/xxx/Documents/Framework/onnxruntime/build/Linux/Debug', '--config', 'Debug']' returned non-zero exit status 2.

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA A100 80GB PCIe with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100 80GB PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "tools/export_onnx_models.py", line 117, in <module>
    convert_onnx(args.model_name, args.model_path, args.batch_size, export_fp16=args.fp16, verbose=args.verbose)
  File "tools/export_onnx_models.py", line 69, in convert_onnx
    inputs = torch.rand(batch_size, 3, 224, 224, dtype=dtype, device=0)
RuntimeError: CUDA error: no kernel image is available for execution on the device

02、load model is DataParallel format, your should notice:

the op name with 'module.' which will result some operator failed, such as load_state_dict will throw miss match

#
# load model weight and use another model weight update the network weight
# 
model = EfficientNet.from_pretrained('efficientnet-b4', num_classes=5)
model.set_swish(memory_efficient=False)
dataparallel_model = torch.load(model_path, map_location="cpu")
from collections import OrderedDict
new_state_dict = OrderedDict()
# method 1: use module state_dict to update weight
for k in dataparallel_model.module.state_dict():
    new_state_dict[k] = dataparallel_model.module.state_dict()[k]

# method 2: current dataparallel_model weight is module._xxxname 
for k in dataparallel_model.state_dict():
    new_state_dict[k[7:]] = dataparallel_model.state_dict()[k]

model.load_state_dict(new_state_dict)
model.cuda()
torch.onnx.export(model, inputs, output_fn, verbose=verbose)

03、 Some operator not supported by ONNX

WARNING: The shape inference of prim::PythonOp type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Traceback (most recent call last):
  File "export_onnx_efficient_cls.py", line 79, in <module>
    convert_onnx("efficient_b4_big_5cls", args.model_path, args.batch_size)
  File "export_onnx_efficient_cls.py", line 55, in convert_onnx
    torch.onnx.export(model.module, inputs, output_fn, verbose=verbose)
  File "/home/xxxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/onnx/__init__.py", line 350, in export
    return utils.export(
  File "/home/xxxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/onnx/utils.py", line 163, in export
    _export(
  File "/home/xxxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/onnx/utils.py", line 1110, in _export
    ) = graph._export_onnx(  # type: ignore[attr-defined]
RuntimeError: ONNX export failed: Couldn't export Python operator SwishImplementation

04、获取onnx模型的输出

# get onnx output
input_all = [node.name for node in onnx_model.graph.input]
input_initializer = [
    node.name for node in onnx_model.graph.initializer
]
net_feed_input = list(set(input_all) - set(input_initializer))
assert (len(net_feed_input) == 1)

05、TypeError: Descriptors cannot not be created directly.

Traceback (most recent call last):
  File "export_onnx_models.py", line 4, in <module>
    import onnx
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/onnx/__init__.py", line 6, in <module>
    from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors, convert_model_to_external_data
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/onnx/external_data_helper.py", line 9, in <module>
    from .onnx_pb import TensorProto, ModelProto, AttributeProto, GraphProto
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/onnx/onnx_pb.py", line 4, in <module>
    from .onnx_ml_pb2 import *  # noqa
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 33, in <module>
    _descriptor.EnumValueDescriptor(
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

protobuf版本太高与现有的onnxparser不兼容，根据错误提示降低protobuf的版本即可。 python3 -m pip install protobuf==3.19.4

06、AttributeError: 'Upsample' object has no attribute 'recompute_scale_factor'

Traceback (most recent call last):
  File "export_onnx_models.py", line 148, in <module>
    convert_onnx(args.model_name, args.model_path, batch_size=args.batch_size, image_size=args.img_size, export_fp16=args.fp16, simplify=args.simplify, verify=args.verify, verbose=args.verbose)
  File "export_onnx_models.py", line 75, in convert_onnx
    test_infer_performance(model=model, model_name=model_name, batch_size=batch_size, input_shape=(3, image_size, image_size), num_data=10240)
  File "/home/xxx/Repo/infra_utilities/model_utils.py", line 72, in test_infer_performance
    ret = model(data)
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/Repo/infra_utilities/./models/yolox/models/yolox.py", line 30, in forward
    fpn_outs = self.backbone(x)
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/Repo/infra_utilities/./models/yolox/models/yolo_pafpn.py", line 98, in forward
    f_out0 = self.upsample(fpn_out0)  # 512/16
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/x x x/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/nn/modules/upsampling.py", line 154, in forward
    recompute_scale_factor=self.recompute_scale_factor)
  File "/home/xxx/software/miniconda3/envs/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Upsample' object has no attribute 'recompute_scale_factor'

torch版本降低到版本1.9.1，torchvision版本降低到版本0.10.1。但是我是通过在torch代码里进行更改进行解决。

https://github.com/pytorch/pytorch/pull/43535/files

07、RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

转化数据格式类型即可

08、ERROR - In node 23 (parseGraph): INVALID_NODE: Invalid Node - Pad_23

[shuffleNode.cpp::symbolicExecute::392] Error Code 4: Internal Error (Reshape_12: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])

因为tensorrt 截止到版本8.4.1为止，对动态输入的情况下，这种输入不支持导致的。遇到这种问题，当前只能先取消对动态输入的支持，采用固定shape的输入。

python3 -m pip install torch==1.12.0+cu115 -f https://download.pytorch.org/whl/torch_stable.html

参考链接

https://onnxruntime.ai/docs/execution-providers/ "onnxruntime execution"↩︎
https://onnxruntime.ai/docs/performance/↩︎