此处演示如何使用 Pytorch Profiler

官方教程

pytroch Profiler位于torch.autograd.profiler, 目前支持的功能:

  • CPU/GPU 端 Op 执行时间统计

  • CPU/GPU 端 Op 输入Tensor的维度分析

  • Op 的内存消耗统计

使用环境 Python=3.11.6 torch=2.1.2+cu121

Import

1
2
3
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

Resnet model

以 resnet 为例。创建 resnet 模型实例,初始化一个 input

1
2
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

Using profiler in CPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

with profile(
activities=[ProfilerActivity.CPU],
record_shapes=False,
profile_memory=False,
) as prof:
with record_function("model_inference"):
model(inputs)

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

其中第 8 行 profile 的相关参数解释如下:

  • activities :list类型,指定profiler的监视范围,ProfilerActivity.CPU 表示仅监听 CPU 活动

  • record_shapes :bool类型,是否记录算子(operator)的 input shape size

  • profile_memory :bool类型,是否记录模型tensor的内存消耗量

其中第 13 行 record_function 指定了需要监听的函数

得到输出

Text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 3.85% 3.699ms 100.00% 96.039ms 96.039ms 1
aten::conv2d 0.12% 119.000us 65.86% 63.255ms 3.163ms 20
aten::convolution 0.41% 395.000us 65.74% 63.136ms 3.157ms 20
aten::_convolution 0.34% 322.000us 65.33% 62.741ms 3.137ms 20
aten::mkldnn_convolution 64.68% 62.118ms 64.99% 62.419ms 3.121ms 20
aten::batch_norm 0.08% 77.000us 15.54% 14.928ms 746.400us 20
aten::_batch_norm_impl_index 0.17% 159.000us 15.46% 14.851ms 742.550us 20
aten::native_batch_norm 15.12% 14.524ms 15.28% 14.673ms 733.650us 20
aten::max_pool2d 0.02% 20.000us 8.49% 8.157ms 8.157ms 1
aten::max_pool2d_with_indices 8.47% 8.137ms 8.47% 8.137ms 8.137ms 1
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 96.039ms

指标解释如下:

  • Name :函数名称

  • Self CPU % :不考虑函数嵌套,此函数的 CPU 占用率

  • Self CPU :不考虑函数嵌套,此函数的 CPU 占用时间

  • CPU total % :考虑函数嵌套,此函数的总 CPU 占用率

  • CPU total :考虑函数嵌套,此函数的总 CPU 占用时间

  • CPU time avg :此函数每次调用的平均用时

  • of Calls :此函数调用次数

Using profiler in GPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=False,
profile_memory=False,
) as prof:
with record_function("model_inference"):
model(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

与 CPU 版本相比,有以下变化:

  • 模型和输入需要提前分配在现存上,使用.cuda()函数

  • 第 9 行activities参数修改为[ProfilerActivity.CPU, ProfilerActivity.CUDA]

  • 第 16 行sort_by参数修改为cuda_time_total

得到输出

Text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 0.86% 3.915ms 100.00% 452.950ms 452.950ms 2.276ms 0.50% 452.948ms 452.948ms 1
aten::conv2d 0.13% 597.000us 85.10% 385.469ms 19.273ms 175.000us 0.04% 384.578ms 19.229ms 20
aten::convolution 0.17% 789.000us 84.97% 384.872ms 19.244ms 481.000us 0.11% 384.403ms 19.220ms 20
aten::_convolution 2.19% 9.905ms 84.80% 384.083ms 19.204ms 9.736ms 2.15% 383.922ms 19.196ms 20
aten::cudnn_convolution 82.61% 374.178ms 82.61% 374.178ms 18.709ms 374.186ms 82.61% 374.186ms 18.709ms 20
aten::batch_norm 0.05% 208.000us 4.78% 21.663ms 1.083ms 118.000us 0.03% 23.597ms 1.180ms 20
aten::_batch_norm_impl_index 0.25% 1.124ms 4.74% 21.455ms 1.073ms 806.000us 0.18% 23.479ms 1.174ms 20
aten::cudnn_batch_norm 4.14% 18.733ms 4.49% 20.331ms 1.017ms 18.819ms 4.15% 22.673ms 1.134ms 20
aten::linear 0.02% 70.000us 3.64% 16.474ms 16.474ms 58.000us 0.01% 16.481ms 16.481ms 1
aten::addmm 3.58% 16.215ms 3.60% 16.289ms 16.289ms 16.226ms 3.58% 16.304ms 16.304ms 1
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 452.950ms
Self CUDA time total: 452.948ms

指标解释同 CPU 类似,不赘述

Export tracing file

Profiler 可以将分析结果导出为.json文件

1
2
3
4
5
6
7
8
9
10
11
12
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=False,
profile_memory=False,
) as prof:
with record_function("model_inference"):
model(inputs)

prof.export_chrome_trace("temp.json")

可在 edge://tracing/ 可视化此文件,以时间轴的方式可视化函数执行过程和嵌套关系

image.png