色综合久久久久网,少妇一区二区视频,亚洲成av人天堂

ACK基于Scheduling Framework機制，實現GPU拓撲感知調度，即在節點的GPU組合中選擇具有最優訓練速度的組合。本文介紹如何使用GPU拓撲感知調度來提升TensorFlow分布式訓練的訓練速度。

前提條件

已創建ACK Pro集群，且集群的實例規格類型選擇為GPU云服務器。更多信息，請參見創建Kubernetes托管版集群。
已安裝Arena。
已安裝GPU拓撲感知調度組件。

系統組件版本滿足以下要求。

組件	版本要求
Kubernetes	1.18.8及以上版本
Nvidia	418.87.01及以上版本
訓練框架NCCL版本	2.7+
操作系統	CentOS 7.6 CentOS 7.7 Ubuntu 16.04 Ubuntu 18.04 Alibaba Cloud Linux 2 Alibaba Cloud Linux 3
顯卡	V100

注意事項

僅支持MPI作業的分布式訓練。
只有當提交作業的所有Pod對資源請求都滿足條件時，才能創建Pod并啟動作業，否則請求會處于資源等待狀態。

操作步驟

節點配置

執行以下命令，設置節點Label，顯式激活節點GPU拓撲感知調度。

kubectl label node <Your Node Name> ack.node.gpu.schedule=topology

說明

當節點激活GPU拓撲感知調度后，不再支持普通GPU資源調度。您可執行以下命令更改Label，恢復普通GPU資源調度功能。

kubectl label node <Your Node Name> ack.node.gpu.schedule=default --overwrite

提交作業

提交MPI作業，并設置--gputopology為true。

arena submit --gputopology=true --gang ***

示例一：訓練Vgg16

說明

本示例測試集群有2臺8卡V100機器。

使用GPU拓撲感知調度訓練Vgg16

執行以下命令，向集群提交作業。

arena submit mpi \
  --name=tensorflow-topo-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

執行以下命令，查看當前作業運行情況。

arena get tensorflow-topo-4-vgg16 --type mpijob

預期輸出：

Name:      tensorflow-topo-4-vgg16
Status:    RUNNINGNamespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  2m

Instances:
  NAME                                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                    ------   ---  --------  --------------  ----
  tensorflow-topo-4-vgg16-launcher-lmhjl  Running  2m   true      0               cn-shanghai.192.168.16.172
  tensorflow-topo-4-vgg16-worker-0        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-1        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-2        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-3        Running  2m   false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日志信息。
```
arena logs -f tensorflow-topo-4-vgg16
```
預期輸出：
```
total images/sec: 991.92
```

使用普通GPU調度訓練Vgg16

執行以下命令，向集群提交作業。

arena submit mpi \
  --name=tensorflow-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

執行以下命令，查看當前作業運行情況。

arena get tensorflow-4-vgg16 --type mpijob

預期輸出：

Name:      tensorflow-4-vgg16
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  9s

Instances:
  NAME                               STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                               ------   ---  --------  --------------  ----
  tensorflow-4-vgg16-launcher-xc28k  Running  9s   true      0               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
  tensorflow-4-vgg16-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日志信息。
```
arena logs -f tensorflow-4-vgg16
```
預期輸出：
```
total images/sec: 200.47
```

示例二：訓練Resnet50

使用GPU拓撲感知調度訓練Resnet50

執行以下命令，向集群提交作業。

arena submit mpi \
  --name=tensorflow-topo-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64  --variable_update=horovod"

執行以下命令，查看當前作業運行情況。

arena get tensorflow-topo-4-resnet50 --type mpijob

預期輸出：

Name:      tensorflow-topo-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  8s

Instances:
  NAME                                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                       ------   ---  --------  --------------  ----
  tensorflow-topo-4-resnet50-launcher-7ln8j  Running  8s   true      0               cn-shanghai.192.168.16.172
  tensorflow-topo-4-resnet50-worker-0        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-1        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-2        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-3        Running  8s   false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日志信息。

arena logs -f tensorflow-topo-4-resnet50

預期輸出：

total images/sec: 1471.55

使用普通GPU調度訓練Resnet50

執行以下命令，向集群提交作業。

arena submit mpi \
  --name=tensorflow-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64  --variable_update=horovod"

執行以下命令，查看當前作業運行情況。

arena get tensorflow-4-resnet50 --type mpijob

預期輸出：

Name:      tensorflow-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  9s

Instances:
  NAME                                  STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                  ------   ---  --------  --------------  ----
  tensorflow-4-resnet50-launcher-q24hv  Running  9s   true      0               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
  tensorflow-4-resnet50-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日志信息。
```
arena logs -f tensorflow-4-resnet50
```
預期輸出：
```
total images/sec: 745.38
```

性能對比

基于以上4個測試用例性能對比結果如下： GPU31

基于對比圖，可知經過GPU拓撲感知調度后，TensorFlow分布式訓練的效果有了很大的提升。

重要

本文提供的性能數據僅為理論值，GPU拓撲感知調度提升結果與您使用的模型以及集群的環境有一定關系，實際數據以您的操作環境為準。您可以參考上述使用示例，評測自己的模型。

日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

使用GPU拓撲感知調度（Tensorflow版）

前提條件

注意事項

操作步驟

節點配置

提交作業

示例一：訓練Vgg16

使用GPU拓撲感知調度訓練Vgg16

使用普通GPU調度訓練Vgg16

示例二：訓練Resnet50

使用GPU拓撲感知調度訓練Resnet50

使用普通GPU調度訓練Resnet50

性能對比