update tutorial

2023-05-24 18:48:19 +08:00 · 2023-05-24 18:48:19 +08:00 · c341a6df2f
commit c341a6df2f
parent ac14180946
1 changed files with 12 additions and 6 deletions
--- a/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
+++ b/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
@ -4,13 +4,17 @@ Welcome to the tutorial of AutoGPTQ, in this chapter, you will learn advanced mo
 ## Arguments Introduction
 In previous chapter, you learned how to load model into CPU or single GPU with the two basic apis:
 - `.from_pretrained`: by default, load the whole pretrained model into CPU.
- `.from_quantized`: by default, load the whole quantized model into CPU, one can set `device='cuda'` to load model into a single GPU.
+- `.from_quantized`: by default, `auto_gptq` will automatically find the suitable way to load the quantized model.
+  - if there is only single GPU and model can fit into it, will load the whole model into that GPU;
+  - if there are multiple GPUs and model can fit into them, will evenly split model and load into those GPUs;
+  - if model can't fit into GPU(s), will use CPU offloading.

-However, the default settings above may not meet many users' demands, for they want to try really large models but haven't enough CPU/GPU memory.
+However, the default settings above may not meet many users' demands, for they want to have more control of model loading.

-Luckily, in AutoGPTQ, we provide two advanced arguments that users can tweak based on the memory of hardware:
+Luckily, in AutoGPTQ, we provide some advanced arguments that users can tweak to manually config model loading strategy:
+- `low_cpu_mem_usage`: `bool` type argument, defaults to False, can be used both in `.from_pretrained` and `.from_quantized`, one can enable it when there is a limitation of CPU memory(by default model will be initialized in CPU) or want to load model faster.
 - `max_memory`: an optional `List[Dict[Union[str, int], str]]` type argument, can be used both in `.from_pretrained` and `.from_quantized`.
- `device_map`: an optional `str` type argument, currently only be supported in `.from_quantized`.
+- `device_map`: an optional `Union[str, Dict[str, Union[int, str]]]` type argument, currently only be supported in `.from_quantized`.

 Before `auto-gptq`'s existence, there are many users have already used other popular tools such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) to quantize their model and saved with different name without `quantize_config.json` file introduced in previous chapter.

@ -50,9 +54,11 @@ max_memory = {0: "20GIB", "cpu": "20GIB"}
 In this case, you can also load model that smaller than 40GB but the rest 20GB will be kept in CPU memory, only be collected into GPU when needed.

 ### device_map
-So far, only `.from_quantized` supports this argument. You can specify it to use pre-set model loading strategies. Because under the hood, modules in model will be mapped to different devices based on the given `max_memory`, it's more convenient to use `device_map` directly if you don't want to spend much time on calculating how much memory in each device should be use to load model.
+So far, only `.from_quantized` supports this argument. 

-In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate handle the device map computation. For more pre-set strategies, you can reference to [this document](https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).
+You can provide a string to this argument to use pre-set model loading strategies. Current valid values are `["auto", "balanced", "balanced_low_0", "sequential"]`
+
+In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate handle the device map computation. For more details of this argument, you can reference to [this document](https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).

 ## Best Practice