Update 02-Advanced-Model-Loading-and-Best-Practice.md

This commit is contained in:
潘其威(William) 2023-05-12 19:47:05 +08:00 committed by GitHub
parent 393a2fbac2
commit 6f887f666a
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -59,7 +59,7 @@ In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate han
### At Quantization
It's always recommended to first consider loading the whole model into GPU(s) for it can save the time spend on transferring module's weights between CPU and GPU.
However, not everyone have large GPU memory. Roughly speaking, always specify the maximum memory CPU will be used to load model, then, for each GPU, you can preserve memory that can fit in 1~2(2~3 for the first GPU incase CPU offload used) model layers for examples' tensors and calculations in quantization, and load model weights using all others left. By this, all you need to do is a simple math based on the number of GPUs you have, the size of model weights file(s) and the number of model layers.
However, not everyone have large GPU memory. Roughly speaking, always specify the maximum memory CPU will be used to load model, then, for each GPU, you can preserve memory that can fit in 1\~2(2\~3 for the first GPU incase CPU offload used) model layers for examples' tensors and calculations in quantization, and load model weights using all others left. By this, all you need to do is a simple math based on the number of GPUs you have, the size of model weights file(s) and the number of model layers.
### At Inference
For inference, following this principle: always using single GPU if you can, otherwise multiple GPUs, CPU offload is the last one to consider.