在虛擬運算中，設定GPU環境來執行深度學習相關的服務

深度學習不外乎就是環境內須能使用GPU，我認為能夠安裝GPU Driver為當前最重要的一個環節，也是所有流程裡面最容易裝不好的一個環節，因此特別寫下詳細的操作完整流程。

大致流程：

創建一個配有GPU的Ubuntu虛擬運算。
安裝GPU。
安裝Docker並確認可執行GPU。

先安裝GPU驅動:

先行關掉nouveau驅動，請參考這篇流程：https://hackmd.io/@Chieh/B1OP54uZq 然後於TWCC頁面上重新啟動。

安裝相依包：

sudo apt-get update
sudo apt-get install build-essential gcc-multilib dkms
sudo apt-get install linux-source

取得generic版本。

$(uname -r)

就我的案例來說，我的版本是 5.4.0.94，則需要安裝該版本的相依包。

sudo apt-get install linux-headers-5.4.0-97-generic

於官方網站下載驅動軟體，本次使用的GPU為T4。

開始安裝GPU

sudo sh NVIDIA-Linux-x86_64-470.103.01.run

中途會遇到一些選項:

有關DKMS：選擇No

Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?

有關 32-bit 的libraries：選擇No

Nvidia's 32-bit compatibility libraries?

完成，並掛載Nvidia驅動。

modprobe nvidia

Check by nvidia-smi:

ubuntu@vm1652402369198:~$ nvidia-smi
Fri May 13 09:20:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   41C    P0    21W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安裝 Docker

安裝docker流程如一般官網流程即可。

sudo apt install -y apt-transport-https curl gnupg-agent software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable

sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER
sudo chmod 777 /var/run/docker.sock

安裝 NVIDIA container toolkit (方能於容器中使用GPU)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

如果安裝過程中有bug，可以先用此指令排除

sudo apt --fix-broken install

安裝完成，測試一下。

$docker run --gpus all nvidia/cuda:11.0-base nvidia-smi     

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   34C    P0    16W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

成功於容器內驅動GPU。

接下來就可以在TWCC上，自行使用Docker下載images，來執行任何需要用到GPU環境的深度學習計算～

Reference

https://www.cnblogs.com/pprp/p/9430836.html

在虛擬運算中，設定GPU環境來執行深度學習相關的服務​

先安裝GPU驅動:​

安裝 Docker​

安裝 NVIDIA container toolkit (方能於容器中使用GPU)​

Reference​

在虛擬運算中，設定GPU環境來執行深度學習相關的服務

先安裝GPU驅動:

安裝 Docker

安裝 NVIDIA container toolkit (方能於容器中使用GPU)

Reference