update tutorial
This commit is contained in:
parent
ac14180946
commit
c341a6df2f
1 changed files with 12 additions and 6 deletions
|
@ -4,13 +4,17 @@ Welcome to the tutorial of AutoGPTQ, in this chapter, you will learn advanced mo
|
||||||
## Arguments Introduction
|
## Arguments Introduction
|
||||||
In previous chapter, you learned how to load model into CPU or single GPU with the two basic apis:
|
In previous chapter, you learned how to load model into CPU or single GPU with the two basic apis:
|
||||||
- `.from_pretrained`: by default, load the whole pretrained model into CPU.
|
- `.from_pretrained`: by default, load the whole pretrained model into CPU.
|
||||||
- `.from_quantized`: by default, load the whole quantized model into CPU, one can set `device='cuda'` to load model into a single GPU.
|
- `.from_quantized`: by default, `auto_gptq` will automatically find the suitable way to load the quantized model.
|
||||||
|
- if there is only single GPU and model can fit into it, will load the whole model into that GPU;
|
||||||
|
- if there are multiple GPUs and model can fit into them, will evenly split model and load into those GPUs;
|
||||||
|
- if model can't fit into GPU(s), will use CPU offloading.
|
||||||
|
|
||||||
However, the default settings above may not meet many users' demands, for they want to try really large models but haven't enough CPU/GPU memory.
|
However, the default settings above may not meet many users' demands, for they want to have more control of model loading.
|
||||||
|
|
||||||
Luckily, in AutoGPTQ, we provide two advanced arguments that users can tweak based on the memory of hardware:
|
Luckily, in AutoGPTQ, we provide some advanced arguments that users can tweak to manually config model loading strategy:
|
||||||
|
- `low_cpu_mem_usage`: `bool` type argument, defaults to False, can be used both in `.from_pretrained` and `.from_quantized`, one can enable it when there is a limitation of CPU memory(by default model will be initialized in CPU) or want to load model faster.
|
||||||
- `max_memory`: an optional `List[Dict[Union[str, int], str]]` type argument, can be used both in `.from_pretrained` and `.from_quantized`.
|
- `max_memory`: an optional `List[Dict[Union[str, int], str]]` type argument, can be used both in `.from_pretrained` and `.from_quantized`.
|
||||||
- `device_map`: an optional `str` type argument, currently only be supported in `.from_quantized`.
|
- `device_map`: an optional `Union[str, Dict[str, Union[int, str]]]` type argument, currently only be supported in `.from_quantized`.
|
||||||
|
|
||||||
Before `auto-gptq`'s existence, there are many users have already used other popular tools such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) to quantize their model and saved with different name without `quantize_config.json` file introduced in previous chapter.
|
Before `auto-gptq`'s existence, there are many users have already used other popular tools such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) to quantize their model and saved with different name without `quantize_config.json` file introduced in previous chapter.
|
||||||
|
|
||||||
|
@ -50,9 +54,11 @@ max_memory = {0: "20GIB", "cpu": "20GIB"}
|
||||||
In this case, you can also load model that smaller than 40GB but the rest 20GB will be kept in CPU memory, only be collected into GPU when needed.
|
In this case, you can also load model that smaller than 40GB but the rest 20GB will be kept in CPU memory, only be collected into GPU when needed.
|
||||||
|
|
||||||
### device_map
|
### device_map
|
||||||
So far, only `.from_quantized` supports this argument. You can specify it to use pre-set model loading strategies. Because under the hood, modules in model will be mapped to different devices based on the given `max_memory`, it's more convenient to use `device_map` directly if you don't want to spend much time on calculating how much memory in each device should be use to load model.
|
So far, only `.from_quantized` supports this argument.
|
||||||
|
|
||||||
In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate handle the device map computation. For more pre-set strategies, you can reference to [this document](https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).
|
You can provide a string to this argument to use pre-set model loading strategies. Current valid values are `["auto", "balanced", "balanced_low_0", "sequential"]`
|
||||||
|
|
||||||
|
In the simplest way, you can set `device_map='auto'` and let 🤗 Accelerate handle the device map computation. For more details of this argument, you can reference to [this document](https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).
|
||||||
|
|
||||||
## Best Practice
|
## Best Practice
|
||||||
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue