# 分布式并行训练 （GPU）

`GPU` `进阶` `分布式并行`

<!-- TOC -->

- [分布式并行训练 （GPU）](#分布式并行训练-gpu)
    - [准备环节](#准备环节)
        - [配置分布式环境](#配置分布式环境)
    - [运行脚本](#运行脚本)
    - [运行多机脚本](#运行多机脚本TODO)

本篇教程我们主要讲解，如何在GPU硬件平台上，利用LuoJiaNET的数据并行及自动并行模式训练ResNet-50网络。

### 配置分布式环境

- `OpenMPI-4.0.3`：LuoJiaNET采用的多进程通信库。

- `NCCL-2.7.6`：Nvidia集合通信库。


### 调用集合通信库

在GPU硬件平台上，LuoJiaNET分布式并行训练的通信使用的是NCCL。
同时，LuoJiaNET可加入一行代码:
`context.set_auto_parallel_context(parallel_mode=context.ParallelMode.AUTO_PARALLEL)`，
指定自动并行模式，获取设备的最优并行能力。

下面是调用自动并行的代码样例：

```python
from luojianet_ms import context
from luojianet_ms.communication import init

if __name__ == "__main__":
    context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
    context.set_auto_parallel_context(parallel_mode=context.ParallelMode.AUTO_PARALLEL)
    init("nccl")
    ...
```

其中，

- `mode=context.GRAPH_MODE`：使用分布式训练需要指定运行模式为图模式。
- `init("nccl")`：使能NCCL通信，并完成分布式训练初始化操作。


```bash
#!/bin/bash
mkdir device
mpirun -n 8 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 &
```

脚本会在后台运行，日志文件会保存到device目录下，共跑了10个epoch，每个epoch有234个step，关于Loss部分结果保存在train.log中。选取部分示例，如下：

```text
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
```

在GPU上进行分布式训练时，模型参数的保存和加载可参考[分布式训练模型参数保存和加载](https://www.luojianet.cn/tutorials/zh-CN/master/intermediate/distributed_training/distributed_training_model_parameters_saving_and_loading.html)