ChatGLM3的demo推理项目指定使用某一个GPU运行

尝试运行ChatGLM3项目的composite_demo，我的电脑有2个显卡，是一个24G显存的Tesla P40 ，一个是6G显存的NVIDIA GeForce GTX 1060 6GB。

当我尝试启动这个项目的时候，我发现2块显卡似乎都在被使用，默认的策略应该是使用多块显卡。

由于我的1060显存是不够跑这个模型的，这样的话就会报显存不足的错误。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 5.93 GiB of which 1.50 MiB is free. Including non-PyTorch memory, this process has 5.89 GiB memory in use. Of the allocated memory 5.79 GiB is allocated by PyTorch, and 22.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

尝试扒了下代码，看到下面的配置（composite_demo/client.py ）

self.model = AutoModel.from_pretrained(
                model_path,
                trust_remote_code=True,
                config=config,
                device_map="auto"
            ).eval()

里面有个 device_map ，现在取值是auto，大概意思是自动模式？可能默认就是全部的网卡。

查了官方资料：

https://github.com/THUDM/ChatGLM3/blob/main/DEPLOYMENT_en.md

Multi-GPU Deployment
If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: pip install accelerate, and then load the model through the following methods:

from utils import load_model_on_gpus

model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
This allows the model to be deployed on two GPUs for inference. You can change num_gpus to the number of GPUs you want to use. It is evenly split by default, but you can also pass the device_map parameter to specify it yourself.

大概意思可以通过修改device_map参数来指定网卡分布，但是具体怎么指定也没说。

网上查资料说可以使用：model.hf_device_map来查看当前模型的设备分布。

{'transformer.wte': 0,
 'transformer.drop': 0,
 'transformer.h.0': 0,
 'transformer.h.1': 0,
 'transformer.h.2': 0,
 'transformer.h.3': 0,
 'transformer.h.4': 0,
 'transformer.h.5': 0,
 'transformer.h.6': 0,
 'transformer.h.7': 0,
 'transformer.h.8': 0,
 'transformer.h.9': 0,
 'transformer.h.10': 0,
 'transformer.h.11': 0,
 'transformer.h.12': 0,
 'transformer.h.13': 0,
 'transformer.h.14': 0,
 'transformer.h.15': 0,
 'transformer.h.16': 0,
 'transformer.h.17': 0,
 'transformer.h.18': 0,
 'transformer.h.19': 0,
 'transformer.h.20': 0,
 'transformer.h.21': 0,
 'transformer.h.22': 0,
 'transformer.h.23': 0,
 'transformer.h.24': 1,
 'transformer.h.25': 1,
 'transformer.h.26': 1,
 'transformer.h.27': 1,
 'transformer.ln_f': 1,
 'lm_head': 1}

然后通过命令：model = load_checkpoint_and_dispatch(model, “sharded-gpt-j-6B”, device_map=my_device_map) 再帮分布加载回去。

后面看到另一篇文章，直接使用：

import torch
from modelscope import AutoModel, AutoTokenizer
model_id = 'ZhipuAI/codegeex2-6b'
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, device_map={'': 'cuda:0'},  # or device_map='auto'
                                  torch_dtype=torch.bfloat16, trust_remote_code=True)
model = model.eval()
# remember adding a language tag for better performance
prompt = "# language: python\n# write a bubble sort function\n"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_length=256)
response = tokenizer.decode(outputs[0])
print(response)

可以看到可以指定 device_map={”: ‘cuda:0’}，这样就是直接使用第一个显卡来跑模型。

修改后的配置：

            self.model = AutoModel.from_pretrained(
                model_path,
                trust_remote_code=True,
                config=config,
                # device_map="auto"
                device_map={'': 'cuda:0'}
            ).eval()

参考资料：

https://developer.aliyun.com/article/1287285

https://blog.csdn.net/Alex_StarSky/article/details/134188318

https://cloud.tencent.com/developer/article/2274903