GPU & PyTorch Troubleshooting Report

这是针对在 Windows 开发环境中运行

from datasets import load_dataset

dataset = load_dataset("json", data_files="dataset.jsonl")

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2-7B-Instruct",
    max_seq_length = 2048,
    dtype = torch.float16,
    load_in_4bit = True,
)

时出现的 GPU/torch 识别问题的完整排查与处理记录，包含我所执行的步骤、关键命令与输出、原因分析与后续建议。

一、问题概述

在尝试运行 python main.py 时，程序在导入 unsloth 过程中抛出错误：

NotImplementedError: Unsloth cannot find any torch accelerator? You need a GPU.

但系统的 nvidia-smi 显示机器上确实有 NVIDIA GPU（Tesla P40）。因此需排查为何 PyTorch 没能识别到 CUDA 设备。

PS C:Windowssystem32> nvidia-smi
Wed Nov  5 11:34:30 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61                 Driver Version: 551.61         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                    TCC   |   00000000:04:00.0 Off |                  Off |
| N/A   36C    P8             11W /  250W |       8MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

二、环境信息（关键项）

工作目录: c:\Users\admin\workspace\qwen-fintune
虚拟环境: .venv（已激活用于测试与安装）
GPU (nvidia-smi): Tesla P40

驱动（来自 nvidia-smi 输出的摘要）：Driver Version: 551.61, CUDA Version: 12.4

三、排查步骤与执行记录

下面按时间线列出执行的命令与关键输出（已摘录并简要说明）：

运行 main.py 报错

(.venv) PS C:Usersadminworkspaceqwen-fintune> python .main.py
Generating train split: 2 examples [00:00, 9.21 examples/s]
Traceback (most recent call last):
File "C:Usersadminworkspaceqwen-fintunemain.py", line 5, in <module>
from unsloth import FastLanguageModel
File "C:Usersadminworkspaceqwen-fintune.venvLibsite-packagesunsloth_init.py", line 77, in <module>
import unsloth_zoo
File "C:Usersadminworkspaceqwen-fintune.venvLibsite-packagesunsloth_zoo_init.py", line 140, in <module>
from .device_type import (
File "C:Usersadminworkspaceqwen-fintune.venvLibsite-packagesunsloth_zoodevice_type.py", line 56, in <module>
DEVICE_TYPE : str = get_device_type()
^^^^^^^^^^^^^^^^^
File "C:Usersadminworkspaceqwen-fintune.venvLibsite-packagesunsloth_zoodevice_type.py", line 46, in get_device_type
raise NotImplementedError("Unsloth cannot find any torch accelerator? You need a GPU.")
NotImplementedError: Unsloth cannot find any torch accelerator? You need a GPU.

检查系统 GPU（nvidia-smi）：

PS> nvidia-smi
Wed Nov  5 10:43:39 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61                 Driver Version: 551.61         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+ 
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
|   0  Tesla P40                    TCC   |   00000000:04:00.0 Off |                  Off |
| N/A   59C    P0             55W /  250W |   20130MiB /  24576MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

此时有进程（例如 ollama.exe）占用了大量显存（~20GB），我们后来也释放了该进程以避免显存不足。

在虚拟环境中检测 PyTorch 与 CUDA 可用性（初次）：

命令：

.venvScriptspython.exe -c "import sys, torch; print('PY:', sys.executable); print('TORCH:', getattr(torch,'__version__','<no torch>')); print('CUDA available:', torch.cuda.is_available() if hasattr(torch,'cuda') else 'no cuda module'); print('torch.version.cuda:', getattr(torch.version,'cuda',None)); print('device count:', torch.cuda.device_count() if hasattr(torch,'cuda') else 0); print('names:', [torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())] if torch.cuda.device_count() else [])"

输出摘要（当时）：

PY: C:Usersadminworkspaceqwen-fintune.venvScriptspython.exe
TORCH: 2.8.0+cpu
CUDA available: False
torch.version.cuda: None
device count: 0
names: []

结论：虚拟环境中安装的 PyTorch 是 CPU-only（版本后缀含 +cpu），因此 torch.cuda.is_available() 为 False。

尝试安装带 CUDA 支持的 PyTorch（目标：cu121）

命令（我在 venv 中执行）：

.venvScriptspython.exe -m pip install --upgrade pip
.venvScriptspython.exe -m pip install --force-reinstall torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

安装过程输出摘要与重要信息：

pip 最终安装了 torch-2.9.0 和相关包，但 pip 输出显示存在依赖冲突（xformers 0.0.32.post2 requires torch==2.8.0），pip 在解析期间回报了冲突警告。
更重要的是，验证安装后 torch.__version__ 显示为 2.9.0+cpu，torch.cuda.is_available() 仍为 False。

说明：尽管使用了 PyTorch 的 cu121 索引，但 pip 最终安装的 wheel 仍为 CPU-only（这通常由 Python 版本、wheel 可用性或索引匹配问题引起）。

为更细粒度查看，创建并运行了 check-cuda.py（已加入仓库），该脚本尝试通过三种方式收集设备信息：torch、pynvml、nvidia-smi。脚本位于：check-cuda.py。

脚本运行后的 JSON 输出（已格式化，关键信息）：

{
  "python_executable": "C:\Users\admin\workspace\qwen-fintune\.venv\Scripts\python.exe",
  "torch": {
    "torch_version": "2.9.0+cpu",
    "cuda_available": false,
    "torch_cuda_version": null,
    "device_count": 0,
    "devices": []
  },
  "pynvml": {
    "gpus": [
      {
        "id": 0,
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
    ]
  },
  "nvidia_smi": {
    "gpus": [
      {
        "index": 0,
        "name": "Tesla P40",
        "memory_total_MiB": 24576,
        "memory_free_MiB": 24465,
        "memory_used_MiB": 8,
        "util_gpu_percent": 0,
        "util_mem_percent": 0
      }
    ]
  }
}

注意：脚本运行期间出现了一个警告（来自 torch 的 import）建议使用 nvidia-ml-py 替代 pynvml，这只是 API 的提示，并不影响设备的可见性判断。

四、根本原因

总结：

系统驱动与 nvidia-smi/pynvml 均能识别 Tesla P40，表明底层驱动正常。
但虚拟环境内安装的 PyTorch 为 CPU-only（wheel 名后缀或内部标记为 +cpu），因此 torch 无法访问 CUDA runtime，导致 unsloth 抛出“需要 GPU”的错误。
pip 在尝试安装 CUDA wheel 时出现两个问题：一是与 xformers 的版本依赖冲突，二是 pip 最终安装了 CPU wheel（可能与 Python 版本或索引匹配有关）。

五、做过的操作（简洁记录）

手动结束占用显存的进程（ollama.exe）以释放 VRAM（用户在会话中完成）。
在 venv 中尝试安装 torch torchvision torchaudio（目标 cu121），pip 显示已安装 torch 2.9.0，但仍为 CPU-only。
新增 check-cuda.py 用于一致性检查并记录输出。

六、建议的最终解决方案与命令

下面两种常用策略中选择其一：

方案 A（推荐——保持 xformers，如果你需要它，则选择与 xformers 兼容的 CUDA torch 版本）：

检查当前 Python 版本（必须与 wheel ABI 匹配）：

.venv\Scripts\python.exe --version

查找与 xformers 要求的 torch==2.8.0 相兼容的 CUDA wheel（例如 torch==2.8.0+cu121），如果存在则安装：

.venv\Scripts\python.exe -m pip uninstall -y torch torchvision torchaudio
.venv\Scripts\python.exe -m pip install "torch==2.8.0+cu121" torchvision==0.23.0+cu121 torchaudio==2.8.0+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

注意：版本号仅为示例。请先在 PyTorch 官方页面选择与你的 Python 版本、CUDA 运行时（cu121）匹配的精确 wheel 版本。

方案 B（更简单——不需要 xformers，先卸载它并安装最新版有 CUDA 的 torch）：

.venv\Scripts\Activate.ps1
# 可选：备份 requirements, 或记录当前包
pip uninstall -y xformers torch torchvision torchaudio
pip install --upgrade pip
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
# 验证
.venv\Scripts\python.exe -c "import torch; print(torch.__version__); print('cuda available:', torch.cuda.is_available()); print('devices:', torch.cuda.device_count())"

如果安装成功，torch.cuda.is_available() 应返回 True，并且 torch.cuda.device_count() 应大于 0，随后可运行 python .main.py 验证 unsloth 是否能找到 accelerator。

执行检查结果

C:Usersadminworkspaceqwen-fintune.venvLibsite-packagestorchcuda__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.      
  import pynvml  # type: ignore[import]
{
  "python_executable": "C:\Users\admin\workspace\qwen-fintune\.venv\Scripts\python.exe",
  "torch": {
    "torch_version": "2.5.1+cu121",
    "cuda_available": true,
    "torch_cuda_version": "12.1",
    "device_count": 1,
    "devices": [
      {
        "id": 0,
        "name": "Tesla P40",
        "total_memory_bytes": 25662586880,
        "multi_processor_count": 30
      }
    ]
  },
  "pynvml": {
    "gpus": [
      {
        "id": 0,
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
      }
    ]
  },
  "pynvml": {
    "gpus": [
      {
        "id": 0,
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
  },
  "pynvml": {
    "gpus": [
      {
        "id": 0,
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
      {
        "id": 0,
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
        "name": "Tesla P40",
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
        "memory_total_bytes": 25769803776,
        "memory_free_bytes": 25654001664,
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
        "memory_used_bytes": 115802112,
        "gpu_util_percent": 0,
        "memory_util_percent": 0
      }
        "memory_util_percent": 0
      }
      }
    ]
  "nvidia_smi": {
      {
        "index": 0,
        "name": "Tesla P40",
        "memory_free_MiB": 24465,
        "memory_used_MiB": 8,
        "util_gpu_percent": 0,
        "util_mem_percent": 0
      }
    ]
  }
}

七、注意事项与后续建议

保持 Python 版本与 PyTorch wheel 支持的 ABI 匹配（例如 Windows 上某些 torch+CUDA wheel 只在特定 Python 版本上提供）。
在对深度学习库做重大升级（如 torch）前建议先在虚拟环境中备份或记录依赖（pip freeze > requirements.txt）。
如果你的工作依赖 xformers 或其它对 torch 精确版本有强依赖的库，优先根据这些依赖挑选合适的 CUDA wheel，或在团队内部决定是否可以升级/替换相关包。

八、附录：已创建/修改的文件

check-cuda.py — 新增的诊断脚本，用于在 venv 中收集 torch、pynvml 与 nvidia-smi 信息（项目根目录）。

#!/usr/bin/env python3
"""check-cuda.py

Prints GPU/CUDA information using multiple fallbacks:
 - torch (if installed)
 - pynvml (if installed)
 - nvidia-smi subprocess query (if available on PATH)

Output is JSON printed to stdout for easy parsing.
"""

import sys
import json
import subprocess


def gather_torch_info():
	info = {}
	try:
		import torch
		info['torch_version'] = torch.__version__
		info['cuda_available'] = torch.cuda.is_available()
		info['torch_cuda_version'] = getattr(torch.version, 'cuda', None)
		device_count = torch.cuda.device_count() if hasattr(torch, 'cuda') else 0
		info['device_count'] = device_count
		devices = []
		for i in range(device_count):
			try:
				prop = torch.cuda.get_device_properties(i)
				name = prop.name
				total_mem = getattr(prop, 'total_memory', None)
				devices.append({
					'id': i,
					'name': name,
					'total_memory_bytes': total_mem,
					'multi_processor_count': getattr(prop, 'multi_processor_count', None),
				})
			except Exception as e:
				devices.append({'id': i, 'error': str(e)})
		info['devices'] = devices
	except Exception as e:
		info['error'] = str(e)
	return info


def gather_pynvml_info():
	info = {}
	try:
		import pynvml
		try:
			pynvml.nvmlInit()
			count = pynvml.nvmlDeviceGetCount()
			gpus = []
			for i in range(count):
				handle = pynvml.nvmlDeviceGetHandleByIndex(i)
				try:
					name = pynvml.nvmlDeviceGetName(handle)
					# name may be bytes on some systems
					if isinstance(name, bytes):
						name = name.decode(errors='ignore')
					mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
					util = pynvml.nvmlDeviceGetUtilizationRates(handle)
					gpus.append({
						'id': i,
						'name': name,
						'memory_total_bytes': int(mem.total),
						'memory_free_bytes': int(mem.free),
						'memory_used_bytes': int(mem.used),
						'gpu_util_percent': int(util.gpu),
						'memory_util_percent': int(util.memory),
					})
				except Exception as e:
					gpus.append({'id': i, 'error': str(e)})
			info['gpus'] = gpus
			pynvml.nvmlShutdown()
		except Exception as e:
			info['nvml_error'] = str(e)
	except Exception as e:
		info['import_error'] = str(e)
	return info


def gather_nvidia_smi():
	info = {}
	try:
		cmd = [
			'nvidia-smi',
			'--query-gpu=index,name,memory.total,memory.free,memory.used,utilization.gpu,utilization.memory',
			'--format=csv,nounits,noheader'
		]
		proc = subprocess.run(cmd, capture_output=True, text=True, check=True)
		rows = []
		for line in proc.stdout.strip().splitlines():
			parts = [p.strip() for p in line.split(',')]
			if len(parts) >= 7:
				try:
					rows.append({
						'index': int(parts[0]),
						'name': parts[1],
						'memory_total_MiB': int(parts[2]),
						'memory_free_MiB': int(parts[3]),
						'memory_used_MiB': int(parts[4]),
						'util_gpu_percent': int(parts[5]),
						'util_mem_percent': int(parts[6]),
					})
				except Exception:
					rows.append({'raw': parts})
			else:
				rows.append({'raw': parts})
		info['gpus'] = rows
	except FileNotFoundError:
		info['error'] = 'nvidia-smi not found on PATH'
	except subprocess.CalledProcessError as e:
		info['error'] = f'nvidia-smi failed: {e}'
	except Exception as e:
		info['error'] = str(e)
	return info


def main():
	result = {
		'python_executable': sys.executable,
	}

	result['torch'] = gather_torch_info()
	result['pynvml'] = gather_pynvml_info()
	result['nvidia_smi'] = gather_nvidia_smi()

	print(json.dumps(result, indent=2, ensure_ascii=False))


if __name__ == '__main__':
	main()