421 lines
9.0 KiB
Markdown
421 lines
9.0 KiB
Markdown
# NVIDIA Docker 驱动版本不匹配 - 远程排查与修复指南
|
||
|
||
## 问题说明
|
||
|
||
错误信息:
|
||
```
|
||
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch
|
||
```
|
||
|
||
这表示 NVIDIA 驱动的用户空间库和内核模块版本不一致。
|
||
|
||
---
|
||
|
||
## 📋 步骤 1:远程诊断
|
||
|
||
在目标机器上运行诊断脚本:
|
||
|
||
```bash
|
||
# 1. 将诊断脚本复制到目标机器
|
||
scp diagnose-nvidia-docker.sh user@remote-host:~/
|
||
|
||
# 2. SSH 登录到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 3. 运行诊断脚本
|
||
bash diagnose-nvidia-docker.sh
|
||
|
||
# 4. 查看生成的诊断报告
|
||
cat nvidia-docker-diagnostic-*.txt
|
||
|
||
# 5. 将报告复制回本地分析(可选)
|
||
# 在本地机器运行:
|
||
scp user@remote-host:~/nvidia-docker-diagnostic-*.txt ./
|
||
```
|
||
|
||
诊断脚本会检查:
|
||
- ✅ NVIDIA 驱动版本(用户空间)
|
||
- ✅ NVIDIA 内核模块版本
|
||
- ✅ Docker 状态和配置
|
||
- ✅ NVIDIA Container Toolkit 状态
|
||
- ✅ 正在使用 GPU 的进程
|
||
- ✅ 系统日志中的错误
|
||
|
||
---
|
||
|
||
## 🔧 步骤 2:根据诊断结果修复
|
||
|
||
### 场景 A:驱动版本不匹配(最常见)
|
||
|
||
**症状:**
|
||
```
|
||
用户空间驱动版本: 550.90.07
|
||
内核模块版本: 550.54.15
|
||
```
|
||
|
||
**修复方案(按优先级):**
|
||
|
||
#### 方案 1:重启 Docker 服务 ⚡(最简单,80% 有效)
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 停止所有容器
|
||
sudo docker stop $(sudo docker ps -aq)
|
||
|
||
# 重启 Docker
|
||
sudo systemctl restart docker
|
||
|
||
# 测试
|
||
sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
|
||
```
|
||
|
||
**如果成功**:问题解决,跳到步骤 3 启动应用。
|
||
|
||
**如果失败**:继续下一个方案。
|
||
|
||
---
|
||
|
||
#### 方案 2:重新加载 NVIDIA 内核模块 💪(95% 有效)
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 使用修复脚本(推荐)
|
||
sudo bash fix-nvidia-docker.sh
|
||
|
||
# 或手动执行:
|
||
# 1. 停止 Docker 和所有使用 GPU 的进程
|
||
sudo systemctl stop docker
|
||
sudo killall -9 python python3 nvidia-smi 2>/dev/null || true
|
||
|
||
# 2. 卸载 NVIDIA 内核模块
|
||
sudo rmmod nvidia_uvm 2>/dev/null || true
|
||
sudo rmmod nvidia_drm 2>/dev/null || true
|
||
sudo rmmod nvidia_modeset 2>/dev/null || true
|
||
sudo rmmod nvidia 2>/dev/null || true
|
||
|
||
# 3. 重新加载模块
|
||
sudo modprobe nvidia
|
||
sudo modprobe nvidia_uvm
|
||
sudo modprobe nvidia_drm
|
||
sudo modprobe nvidia_modeset
|
||
|
||
# 4. 重启 Docker
|
||
sudo systemctl restart docker
|
||
|
||
# 5. 测试
|
||
sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
|
||
```
|
||
|
||
**如果成功**:问题解决。
|
||
|
||
**如果失败**:内核模块可能被某些进程占用,继续下一个方案。
|
||
|
||
---
|
||
|
||
#### 方案 3:重启系统 🔄(99% 有效)
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 重启
|
||
sudo reboot
|
||
|
||
# 等待系统重启(约 1-2 分钟)
|
||
sleep 120
|
||
|
||
# 重新连接并测试
|
||
ssh user@remote-host
|
||
sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
|
||
```
|
||
|
||
**注意**:重启会中断所有服务,请确认可以接受短暂停机。
|
||
|
||
---
|
||
|
||
### 场景 B:NVIDIA Container Toolkit 问题
|
||
|
||
**症状:**
|
||
```
|
||
❌ nvidia-container-cli 未安装
|
||
或
|
||
nvidia-container-cli 版本过旧
|
||
```
|
||
|
||
**修复:**
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 更新 NVIDIA Container Toolkit
|
||
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||
|
||
# 添加仓库(如果未添加)
|
||
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
|
||
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
||
|
||
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
|
||
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
||
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||
|
||
# 安装/更新
|
||
sudo apt-get update
|
||
sudo apt-get install -y nvidia-container-toolkit
|
||
|
||
# 配置 Docker
|
||
sudo nvidia-ctk runtime configure --runtime=docker
|
||
|
||
# 重启 Docker
|
||
sudo systemctl restart docker
|
||
|
||
# 测试
|
||
sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
|
||
```
|
||
|
||
---
|
||
|
||
### 场景 C:Docker 配置问题
|
||
|
||
**症状:**
|
||
```
|
||
/etc/docker/daemon.json 不存在
|
||
或缺少 nvidia runtime 配置
|
||
```
|
||
|
||
**修复:**
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 创建/更新 Docker 配置
|
||
sudo tee /etc/docker/daemon.json <<EOF
|
||
{
|
||
"runtimes": {
|
||
"nvidia": {
|
||
"path": "nvidia-container-runtime",
|
||
"runtimeArgs": []
|
||
}
|
||
},
|
||
"default-runtime": "nvidia"
|
||
}
|
||
EOF
|
||
|
||
# 重启 Docker
|
||
sudo systemctl restart docker
|
||
|
||
# 测试
|
||
sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 步骤 3:启动应用
|
||
|
||
修复成功后,启动 doc_processer 容器:
|
||
|
||
```bash
|
||
# SSH 到目标机器
|
||
ssh user@remote-host
|
||
|
||
# 确保旧容器已停止
|
||
sudo docker rm -f doc_processer 2>/dev/null || true
|
||
|
||
# 启动容器
|
||
sudo docker run -d --gpus all --network host \
|
||
--name doc_processer \
|
||
--restart unless-stopped \
|
||
-v /home/yoge/.paddlex:/root/.paddlex:ro \
|
||
-v /home/yoge/.cache/modelscope:/root/.cache/modelscope:ro \
|
||
-v /home/yoge/.cache/huggingface:/root/.cache/huggingface:ro \
|
||
doc_processer:latest
|
||
|
||
# 检查容器状态
|
||
sudo docker ps | grep doc_processer
|
||
|
||
# 查看日志
|
||
sudo docker logs -f doc_processer
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 验证和监控
|
||
|
||
### 验证 GPU 访问
|
||
|
||
```bash
|
||
# 检查容器内的 GPU
|
||
sudo docker exec doc_processer nvidia-smi
|
||
|
||
# 测试 API
|
||
curl http://localhost:8053/health
|
||
```
|
||
|
||
### 监控日志
|
||
|
||
```bash
|
||
# 实时日志
|
||
sudo docker logs -f doc_processer
|
||
|
||
# 查看最近 100 行
|
||
sudo docker logs --tail 100 doc_processer
|
||
```
|
||
|
||
---
|
||
|
||
## 🛠️ 常用远程命令
|
||
|
||
### 一键诊断并尝试修复
|
||
|
||
```bash
|
||
# 在目标机器创建这个脚本
|
||
cat > quick-fix.sh <<'EOF'
|
||
#!/bin/bash
|
||
set -e
|
||
|
||
echo "🔧 快速修复脚本"
|
||
echo "================"
|
||
|
||
# 方案 1: 重启 Docker
|
||
echo "尝试重启 Docker..."
|
||
sudo docker stop $(sudo docker ps -aq) 2>/dev/null || true
|
||
sudo systemctl restart docker
|
||
sleep 3
|
||
|
||
if sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi &>/dev/null; then
|
||
echo "✅ 修复成功(重启 Docker)"
|
||
exit 0
|
||
fi
|
||
|
||
# 方案 2: 重载模块
|
||
echo "尝试重载 NVIDIA 模块..."
|
||
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia 2>/dev/null || true
|
||
sudo modprobe nvidia nvidia_uvm nvidia_drm nvidia_modeset
|
||
sudo systemctl restart docker
|
||
sleep 3
|
||
|
||
if sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi &>/dev/null; then
|
||
echo "✅ 修复成功(重载模块)"
|
||
exit 0
|
||
fi
|
||
|
||
# 方案 3: 需要重启
|
||
echo "❌ 自动修复失败,需要重启系统"
|
||
echo "执行: sudo reboot"
|
||
exit 1
|
||
EOF
|
||
|
||
chmod +x quick-fix.sh
|
||
sudo bash quick-fix.sh
|
||
```
|
||
|
||
### SSH 隧道(如果需要本地访问远程服务)
|
||
|
||
```bash
|
||
# 在本地机器运行
|
||
ssh -L 8053:localhost:8053 user@remote-host
|
||
|
||
# 现在可以在本地访问
|
||
curl http://localhost:8053/health
|
||
```
|
||
|
||
---
|
||
|
||
## 📝 故障排除检查清单
|
||
|
||
- [ ] 运行 `diagnose-nvidia-docker.sh` 生成完整诊断报告
|
||
- [ ] 检查驱动版本是否一致(用户空间 vs 内核模块)
|
||
- [ ] 检查 NVIDIA Container Toolkit 是否安装
|
||
- [ ] 检查 `/etc/docker/daemon.json` 配置
|
||
- [ ] 尝试重启 Docker 服务
|
||
- [ ] 尝试重新加载 NVIDIA 内核模块
|
||
- [ ] 检查是否有进程占用 GPU
|
||
- [ ] 查看 Docker 日志:`journalctl -u docker -n 100`
|
||
- [ ] 最后手段:重启系统
|
||
|
||
---
|
||
|
||
## 💡 预防措施
|
||
|
||
### 1. 固定 NVIDIA 驱动版本
|
||
|
||
```bash
|
||
# 锁定当前驱动版本
|
||
sudo apt-mark hold nvidia-driver-*
|
||
|
||
# 查看已锁定的包
|
||
apt-mark showhold
|
||
```
|
||
|
||
### 2. 自动重启 Docker(驱动更新后)
|
||
|
||
```bash
|
||
# 创建 systemd 服务
|
||
sudo tee /etc/systemd/system/nvidia-docker-restart.service <<EOF
|
||
[Unit]
|
||
Description=Restart Docker after NVIDIA driver update
|
||
After=nvidia-persistenced.service
|
||
|
||
[Service]
|
||
Type=oneshot
|
||
ExecStart=/bin/systemctl restart docker
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
EOF
|
||
|
||
sudo systemctl enable nvidia-docker-restart.service
|
||
```
|
||
|
||
### 3. 监控脚本
|
||
|
||
```bash
|
||
# 创建监控脚本
|
||
cat > /usr/local/bin/check-nvidia-docker.sh <<'EOF'
|
||
#!/bin/bash
|
||
if ! docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi &>/dev/null; then
|
||
echo "$(date): NVIDIA Docker 访问失败" >> /var/log/nvidia-docker-check.log
|
||
systemctl restart docker
|
||
fi
|
||
EOF
|
||
|
||
chmod +x /usr/local/bin/check-nvidia-docker.sh
|
||
|
||
# 添加到 crontab(每 5 分钟检查)
|
||
echo "*/5 * * * * /usr/local/bin/check-nvidia-docker.sh" | sudo crontab -
|
||
```
|
||
|
||
---
|
||
|
||
## 📞 需要帮助?
|
||
|
||
如果以上方案都无法解决,请提供:
|
||
|
||
1. **诊断报告**:`nvidia-docker-diagnostic-*.txt` 的完整内容
|
||
2. **错误日志**:`sudo docker logs doc_processer`
|
||
3. **系统信息**:
|
||
```bash
|
||
nvidia-smi
|
||
docker --version
|
||
nvidia-container-cli --version
|
||
uname -a
|
||
```
|
||
|
||
---
|
||
|
||
## 快速参考
|
||
|
||
| 命令 | 说明 |
|
||
|------|------|
|
||
| `bash diagnose-nvidia-docker.sh` | 生成诊断报告 |
|
||
| `sudo bash fix-nvidia-docker.sh` | 自动修复脚本 |
|
||
| `sudo systemctl restart docker` | 重启 Docker |
|
||
| `sudo reboot` | 重启系统 |
|
||
| `docker logs -f doc_processer` | 查看应用日志 |
|
||
| `docker exec doc_processer nvidia-smi` | 检查容器内 GPU |
|