I'm having trouble getting multi-gpu via DataParallel
across two Tesla K80 GPUs. The code I'm using is a modification of the MNIST example:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
from data_parallel import DataParallel
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=256, shuffle=True, num_workers=2, pin_memory=True)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2(x), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x)
model = DataParallel(Net())
model.cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.NLLLoss().cuda()
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
input_var = Variable(data.cuda())
target_var = Variable(target.cuda())
print('Getting model output')
output = model(input_var)
print('Got model output')
loss = criterion(output, target_var)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Finished')
This doesn't throw an error, but hangs after it prints "Getting model output" and never returns. I traced this down to the parallel_apply
spawning threads that then never finish. The line that hangs is here where the threads are spawned using both GPU 0 and GPU 1, but never finish.
This is only a problem when CUDA_VISIBLE_DEVICES=0,1
as both GPU0 and GPU1 work perfectly well individually.
Before running this, nvidia-smi
shows
+------------------------------------------------------+
| NVIDIA-SMI 352.68 Driver Version: 352.68 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 40C P0 57W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:07:00.0 Off | 0 |
| N/A 35C P0 76W / 149W | 55MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
after running (while it hangs), nvidia-smi
gives
+------------------------------------------------------+
| NVIDIA-SMI 352.68 Driver Version: 352.68 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 42C P0 69W / 149W | 251MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:07:00.0 Off | 0 |
| N/A 36C P0 90W / 149W | 249MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4785 C python 194MiB |
| 1 4785 C python 192MiB |
+-----------------------------------------------------------------------------+
and top
shows the main python process and the two python subprocesses. Wondering if this could be something similar to #554.
Using this TensorFlow example, I get linear speedup using multiple GPUs as I change CUDA_VISIBLE_DEVICES
so multiple K80s should certainly be viable.
v-iashin and TaylorT-Kang
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4