Merge pull request #6838 from oobabooga/dev

Merge dev branch
This commit is contained in:
oobabooga 2025-04-09 14:48:37 -03:00 committed by GitHub
commit 14e6baeb48
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
24 changed files with 514 additions and 205 deletions

View file

@ -12,7 +12,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
## Features
- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), and [ExLlamaV2](https://github.com/turboderp-org/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is supported via its own [Dockerfile](https://github.com/oobabooga/text-generation-webui/blob/main/docker/TensorRT-LLM/Dockerfile), and the Transformers loader is compatible with libraries like [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM), but they must be installed manually.
- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), [ExLlamaV3](https://github.com/turboderp-org/exllamav3), and [ExLlamaV2](https://github.com/turboderp-org/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is supported via its own [Dockerfile](https://github.com/oobabooga/text-generation-webui/blob/main/docker/TensorRT-LLM/Dockerfile), and the Transformers loader is compatible with libraries like [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM), but they must be installed manually.
- OpenAI-compatible API with Chat and Completions endpoints see [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
- Automatic prompt formatting using Jinja2 templates.
- Three chat modes: `instruct`, `chat-instruct`, and `chat`, with automatic prompt templates in `chat-instruct`.
@ -78,25 +78,19 @@ conda activate textgen
| System | GPU | Command |
|--------|---------|---------|
| Linux/WSL | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
| Linux/WSL | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cpu` |
| Linux | AMD | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/rocm6.1` |
| MacOS + MPS | Any | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |
| Windows | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
| Windows | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |
| Linux/WSL | NVIDIA | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
| Linux/WSL | CPU only | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu` |
| Linux | AMD | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.1` |
| MacOS + MPS | Any | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0` |
| Windows | NVIDIA | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
| Windows | CPU only | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0` |
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
For NVIDIA, you also need to install the CUDA runtime libraries:
If you need `nvcc` to compile some library manually, you will additionally need to install this:
```
conda install -y -c "nvidia/label/cuda-12.1.1" cuda-runtime
```
If you need `nvcc` to compile some library manually, replace the command above with
```
conda install -y -c "nvidia/label/cuda-12.1.1" cuda
conda install -y -c "nvidia/label/cuda-12.4.1" cuda
```
#### 3. Install the web UI
@ -143,19 +137,6 @@ Then browse to
3) Manually install AutoGPTQ: [Installation](https://github.com/PanQiWei/AutoGPTQ#install-from-source).
* Perform the from-source installation - there are no prebuilt ROCm packages for Windows.
##### Older NVIDIA GPUs
1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:
```
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
conda install -y -c "nvidia/label/cuda-11.8.0" cuda-runtime
```
2) bitsandbytes >= 0.39 may not work. In that case, to use `--load-in-8bit`, you may have to downgrade like this:
* Linux: `pip install bitsandbytes==0.38.1`
* Windows: `pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl`
##### Manual install
The `requirements*.txt` above contain various wheels precompiled through GitHub Actions. If you wish to compile things manually, or if you need to because no suitable wheels are available for your hardware, you can use `requirements_nowheels.txt` and then install your desired loaders manually.

128
css/chat_style-Dark.css Normal file
View file

@ -0,0 +1,128 @@
.message {
display: grid;
grid-template-columns: 60px minmax(0, 1fr);
padding-bottom: 28px;
font-size: 18px;
font-family: Roboto, Arial, sans-serif; /* Modern font */
line-height: 1.5;
}
.circle-you,
.circle-bot {
background-color: #2b2b2b; /* Darker background for circles */
border-radius: 50%; /* Perfect circle */
border: 1px solid #4a90e2; /* Soft blue border */
box-shadow: 0 4px 8px rgb(0 0 0 / 50%); /* Soft shadow for depth */
}
.circle-bot img,
.circle-you img {
border-radius: 50%; /* Make images circular */
width: 100%;
height: 100%;
object-fit: cover;
}
.circle-you, .circle-bot {
width: 64px; /* Smaller size for modern look */
height: 64px;
}
.text {
padding-left: 12px; /* Reduced padding for a cleaner layout */
color: #f0f0f0; /* Light text color for readability */
}
.text p {
margin-top: 2px;
}
.username {
padding-left: 10px;
font-size: 20px;
font-weight: bold;
color: #e0e0e0; /* Light gray text */
transition: color 0.3s ease; /* Smooth color transition */
}
.username:hover {
color: #4a90e2; /* Blue color on hover */
}
.message-body {
position: relative;
border: 1px solid rgb(255 255 255 / 10%); /* Soft white border */
border-radius: 8px; /* Slightly rounded corners */
padding: 15px;
background: #1e1e1e; /* Dark background */
box-shadow: 0 4px 10px rgb(0 0 0 / 30%); /* Subtle shadow for depth */
transition: background 0.3s ease; /* Smooth transition for background */
}
.message-body:hover {
background: #252525; /* Slightly lighter on hover */
}
/* Adds 2 extra lines at the top and bottom of the message */
.message-body::before,
.message-body::after {
content: "";
position: absolute;
left: 10px;
right: 10px;
height: 1px;
background-color: rgb(255 255 255 / 5%); /* Faded lines for subtle separation */
}
.message-body::before {
top: 4px;
}
.message-body::after {
bottom: 4px;
}
.message-body img {
max-width: 300px;
max-height: 300px;
border-radius: 10px; /* Rounded corners for images */
}
.message-body p {
margin-bottom: 0 !important;
font-size: 16px !important;
line-height: 1.5 !important;
color: #e0e0e0 !important; /* Light color for text */
}
.message-body p em {
color: #a6a6a6 !important; /* Softer gray for emphasized text */
}
@media screen and (width <= 688px) {
.message {
display: grid;
grid-template-columns: 60px minmax(0, 1fr);
padding-bottom: 25px;
font-size: 15px;
font-family: Roboto, Arial, sans-serif; /* Modern font */
line-height: 1.5;
}
.circle-you, .circle-bot {
width: 40px; /* Smaller size for mobile */
height: 40px;
}
.text {
padding-left: 10px; /* Reduced padding for mobile */
}
.message-body p {
font-size: 14px !important; /* Smaller text for mobile */
}
.username {
font-size: 18px; /* Smaller username for mobile */
}
}

View file

@ -13,7 +13,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,rw \
WORKDIR /home/app/
RUN git clone https://github.com/oobabooga/text-generation-webui.git
WORKDIR /home/app/text-generation-webui
RUN GPU_CHOICE=C LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
RUN GPU_CHOICE=B LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
COPY CMD_FLAGS.txt /home/app/text-generation-webui/
EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005}
WORKDIR /home/app/text-generation-webui

View file

@ -13,7 +13,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,rw \
WORKDIR /home/app/
RUN git clone https://github.com/oobabooga/text-generation-webui.git
WORKDIR /home/app/text-generation-webui
RUN GPU_CHOICE=E LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
RUN GPU_CHOICE=D LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
COPY CMD_FLAGS.txt /home/app/text-generation-webui/
EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005}
# set umask to ensure group read / write at runtime

View file

@ -9,9 +9,9 @@ the `settings.json` file, see the Examples below. Retrieve your authtoken on the
# Documentation
For a list of all available options, see [the configuration documentation](https://ngrok.com/docs/ngrok-agent/config/) or [the connect example](https://github.com/ngrok/ngrok-py/blob/main/examples/ngrok-connect-full.py).
For a list of all available options, see [the configuration documentation](https://ngrok.com/docs/ngrok-agent/config/) or [the forward example](https://github.com/ngrok/ngrok-python/blob/main/examples/ngrok-forward-full.py).
The ngrok Python SDK is [on github here](https://github.com/ngrok/ngrok-py). A quickstart guide and a full API reference are included in the [ngrok-py Python API documentation](https://ngrok.github.io/ngrok-py/).
The ngrok Python SDK is [on github here](https://github.com/ngrok/ngrok-py). A quickstart guide and a full API reference are included in the [ngrok-py Python API documentation](https://ngrok.github.io/ngrok-python/).
# Running
@ -66,4 +66,4 @@ To add an authtoken instead of using the NGROK_AUTHTOKEN environment variable:
"authtoken_from_env":false
}
}
```
```

179
modules/exllamav3_hf.py Normal file
View file

@ -0,0 +1,179 @@
import os
import traceback
from pathlib import Path
from typing import Any, Dict, Optional, Union
import torch
from exllamav3 import Cache, Config, Model
from torch.nn import CrossEntropyLoss
from transformers import GenerationConfig, PretrainedConfig, PreTrainedModel
from transformers.modeling_outputs import CausalLMOutputWithPast
from modules import shared
from modules.logging_colors import logger
try:
import flash_attn
except Exception:
logger.warning('Failed to load flash-attention due to the following error:\n')
traceback.print_exc()
class Exllamav3HF(PreTrainedModel):
def __init__(self, model_dir):
super().__init__(PretrainedConfig())
self.generation_config = GenerationConfig()
config = Config.from_directory(model_dir)
self.ex_model = Model.from_config(config)
# Calculate the closest multiple of 256 at or above the chosen value
max_tokens = shared.args.max_seq_len
if max_tokens % 256 != 0:
adjusted_tokens = ((max_tokens // 256) + 1) * 256
logger.warning(f"max_num_tokens must be a multiple of 256. Adjusting from {max_tokens} to {adjusted_tokens}")
max_tokens = adjusted_tokens
self.ex_cache = Cache(self.ex_model, max_num_tokens=max_tokens)
# Create load parameters dictionary
load_params = {'progressbar': True}
if shared.args.gpu_split:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
load_params['use_per_device'] = split
self.ex_model.load(**load_params)
self.past_seq = None
self.max_tokens = max_tokens
def _validate_model_class(self):
pass
def _validate_model_kwargs(self, model_kwargs: Dict[str, Any]):
pass
def prepare_inputs_for_generation(self, input_ids, **kwargs):
return {'input_ids': input_ids, **kwargs}
@property
def device(self) -> torch.device:
return torch.device(0)
def __call__(self, *args, **kwargs):
use_cache = kwargs.get('use_cache', True)
labels = kwargs.get('labels', None)
past_key_values = kwargs.get('past_key_values', None)
if len(args) > 0:
if not shared.args.cfg_cache:
logger.error("Please enable the cfg-cache option to use CFG with ExLlamav3_HF.")
return
input_ids = args[0]
is_negative = True
past_seq = self.past_seq_negative
ex_cache = self.ex_cache_negative
else:
input_ids = kwargs['input_ids']
is_negative = False
past_seq = self.past_seq
ex_cache = self.ex_cache
seq = input_ids[0].tolist()
if is_negative and past_key_values is not None:
seq = past_key_values + seq
seq_tensor = torch.tensor(seq)
reset = True
# Make the forward call
if labels is None:
if past_seq is not None:
min_length = min(past_seq.shape[0], seq_tensor.shape[0])
indices = torch.nonzero(~torch.eq(past_seq[:min_length], seq_tensor[:min_length]))
if len(indices) > 0:
longest_prefix = indices[0].item()
else:
longest_prefix = min_length
if longest_prefix > 0:
reset = False
current_len = longest_prefix
if len(seq_tensor) - longest_prefix > 1:
self.ex_model.forward(
input_ids=seq_tensor[longest_prefix:-1].view(1, -1),
params={
"attn_mode": "flash_attn",
"cache": ex_cache,
"past_len": longest_prefix,
"batch_shape": (1, self.max_tokens)
}
)
current_len = longest_prefix + len(seq_tensor) - longest_prefix - 1
if reset:
if len(seq_tensor) > 1:
self.ex_model.forward(
input_ids=seq_tensor[:-1].view(1, -1),
params={
"attn_mode": "flash_attn",
"cache": ex_cache,
"past_len": 0,
"batch_shape": (1, self.max_tokens)
}
)
current_len = len(seq_tensor) - 1
else:
current_len = 0
logits = self.ex_model.forward(
input_ids=seq_tensor[-1:].view(1, -1),
params={
"attn_mode": "flash_attn",
"cache": ex_cache,
"past_len": current_len,
"batch_shape": (1, self.max_tokens)
}
).to(input_ids.device).float()
else:
logits = self.ex_model.forward(
input_ids=seq_tensor.view(1, -1),
params={
"attn_mode": "flash_attn",
"cache": ex_cache,
"past_len": 0,
"batch_shape": (1, self.max_tokens)
}
).float()
if is_negative:
self.past_seq_negative = seq_tensor
else:
self.past_seq = seq_tensor
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, logits.shape[-1])
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
return CausalLMOutputWithPast(logits=logits, past_key_values=seq if use_cache else None, loss=loss)
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], *model_args, **kwargs):
assert len(model_args) == 0 and len(kwargs) == 0, "extra args is currently not supported"
if isinstance(pretrained_model_name_or_path, str):
pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
pretrained_model_name_or_path = Path(f'{shared.args.model_dir}') / Path(pretrained_model_name_or_path)
return Exllamav3HF(pretrained_model_name_or_path)

View file

@ -23,7 +23,6 @@ loaders_and_params = OrderedDict({
'use_double_quant',
'use_eager_attention',
'bf16',
'trust_remote_code',
'no_use_fast',
],
@ -76,6 +75,13 @@ loaders_and_params = OrderedDict({
'no_use_fast',
'llamacpp_HF_info',
],
'ExLlamav3_HF': [
'max_seq_len',
'gpu_split',
'cfg_cache',
'trust_remote_code',
'no_use_fast',
],
'ExLlamav2_HF': [
'max_seq_len',
'cache_type',
@ -174,30 +180,38 @@ def transformers_samplers():
loaders_samplers = {
'Transformers': transformers_samplers(),
'HQQ': transformers_samplers(),
'ExLlamav2': {
'ExLlamav3_HF': {
'temperature',
'dynatemp_low',
'dynatemp_high',
'dynatemp_exponent',
'smoothing_factor',
'smoothing_curve',
'min_p',
'top_p',
'top_k',
'typical_p',
'xtc_threshold',
'xtc_probability',
'epsilon_cutoff',
'eta_cutoff',
'tfs',
'top_a',
'top_n_sigma',
'dry_multiplier',
'dry_allowed_length',
'dry_base',
'repetition_penalty',
'frequency_penalty',
'presence_penalty',
'encoder_repetition_penalty',
'no_repeat_ngram_size',
'repetition_penalty_range',
'guidance_scale',
'mirostat_mode',
'mirostat_tau',
'mirostat_eta',
'do_sample',
'dynamic_temperature',
'temperature_last',
'auto_max_new_tokens',
@ -205,8 +219,12 @@ loaders_samplers = {
'add_bos_token',
'skip_special_tokens',
'seed',
'sampler_priority',
'custom_token_bans',
'negative_prompt',
'dry_sequence_breakers',
'grammar_string',
'grammar_file_row',
},
'ExLlamav2_HF': {
'temperature',
@ -254,6 +272,40 @@ loaders_samplers = {
'grammar_string',
'grammar_file_row',
},
'ExLlamav2': {
'temperature',
'dynatemp_low',
'dynatemp_high',
'dynatemp_exponent',
'smoothing_factor',
'min_p',
'top_p',
'top_k',
'typical_p',
'xtc_threshold',
'xtc_probability',
'tfs',
'top_a',
'dry_multiplier',
'dry_allowed_length',
'dry_base',
'repetition_penalty',
'frequency_penalty',
'presence_penalty',
'repetition_penalty_range',
'mirostat_mode',
'mirostat_tau',
'mirostat_eta',
'dynamic_temperature',
'temperature_last',
'auto_max_new_tokens',
'ban_eos_token',
'add_bos_token',
'skip_special_tokens',
'seed',
'custom_token_bans',
'dry_sequence_breakers',
},
'llama.cpp': {
'temperature',
'min_p',

View file

@ -69,8 +69,9 @@ def load_model(model_name, loader=None):
'Transformers': huggingface_loader,
'llama.cpp': llamacpp_loader,
'llamacpp_HF': llamacpp_HF_loader,
'ExLlamav2': ExLlamav2_loader,
'ExLlamav3_HF': ExLlamav3_HF_loader,
'ExLlamav2_HF': ExLlamav2_HF_loader,
'ExLlamav2': ExLlamav2_loader,
'HQQ': HQQ_loader,
'TensorRT-LLM': TensorRT_LLM_loader,
}
@ -304,11 +305,10 @@ def llamacpp_HF_loader(model_name):
return model
def ExLlamav2_loader(model_name):
from modules.exllamav2 import Exllamav2Model
def ExLlamav3_HF_loader(model_name):
from modules.exllamav3_hf import Exllamav3HF
model, tokenizer = Exllamav2Model.from_pretrained(model_name)
return model, tokenizer
return Exllamav3HF.from_pretrained(model_name)
def ExLlamav2_HF_loader(model_name):
@ -317,6 +317,13 @@ def ExLlamav2_HF_loader(model_name):
return Exllamav2HF.from_pretrained(model_name)
def ExLlamav2_loader(model_name):
from modules.exllamav2 import Exllamav2Model
model, tokenizer = Exllamav2Model.from_pretrained(model_name)
return model, tokenizer
def HQQ_loader(model_name):
try:
from hqq.core.quantize import HQQBackend, HQQLinear

View file

@ -17,6 +17,7 @@ def get_fallback_settings():
'compress_pos_emb': 1,
'alpha_value': 1,
'truncation_length': shared.settings['truncation_length'],
'truncation_length_info': shared.settings['truncation_length'],
'skip_special_tokens': shared.settings['skip_special_tokens'],
'custom_stopping_strings': shared.settings['custom_stopping_strings'],
}
@ -53,7 +54,8 @@ def get_model_metadata(model):
for k in metadata:
if k.endswith('context_length'):
model_settings['n_ctx'] = metadata[k]
model_settings['n_ctx'] = min(metadata[k], 8192)
model_settings['truncation_length_info'] = metadata[k]
elif k.endswith('rope.freq_base'):
model_settings['rope_freq_base'] = metadata[k]
elif k.endswith('rope.scale_linear'):
@ -89,7 +91,8 @@ def get_model_metadata(model):
for k in ['max_position_embeddings', 'model_max_length', 'max_seq_len']:
if k in metadata:
model_settings['truncation_length'] = metadata[k]
model_settings['max_seq_len'] = metadata[k]
model_settings['truncation_length_info'] = metadata[k]
model_settings['max_seq_len'] = min(metadata[k], 8192)
if 'rope_theta' in metadata:
model_settings['rope_freq_base'] = metadata['rope_theta']
@ -155,14 +158,14 @@ def infer_loader(model_name, model_settings):
path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
if not path_to_model.exists():
loader = None
elif (path_to_model / 'quantize_config.json').exists(): # Old GPTQ metadata file
loader = 'ExLlamav2_HF'
elif len(list(path_to_model.glob('*.gguf'))) > 0 and path_to_model.is_dir() and (path_to_model / 'tokenizer_config.json').exists():
loader = 'llamacpp_HF'
elif len(list(path_to_model.glob('*.gguf'))) > 0:
loader = 'llama.cpp'
elif re.match(r'.*\.gguf', model_name.lower()):
loader = 'llama.cpp'
elif re.match(r'.*exl3', model_name.lower()):
loader = 'ExLlamav3_HF'
elif re.match(r'.*exl2', model_name.lower()):
loader = 'ExLlamav2_HF'
elif re.match(r'.*-hqq', model_name.lower()):

View file

@ -53,7 +53,7 @@ settings = {
'skip_special_tokens': True,
'stream': True,
'static_cache': False,
'truncation_length': 2048,
'truncation_length': 8192,
'seed': -1,
'custom_stopping_strings': '',
'custom_token_bans': '',
@ -79,7 +79,6 @@ group.add_argument('--model', type=str, help='Name of the model to load by defau
group.add_argument('--lora', type=str, nargs='+', help='The list of LoRAs to load. If you want to load more than one LoRA, write the names separated by spaces.')
group.add_argument('--model-dir', type=str, default='models/', help='Path to directory with all the models.')
group.add_argument('--lora-dir', type=str, default='loras/', help='Path to directory with all the loras.')
group.add_argument('--model-menu', action='store_true', help='Show a model menu in the terminal when the web UI is first launched.')
group.add_argument('--settings', type=str, help='Load the default interface settings from this yaml file. See settings-template.yaml for an example. If you create a file called settings.yaml, this file will be loaded by default without the need to use the --settings flag.')
group.add_argument('--extensions', type=str, nargs='+', help='The list of extensions to load. If you want to load more than one extension, write the names separated by spaces.')
group.add_argument('--verbose', action='store_true', help='Print the prompts to the terminal.')
@ -87,7 +86,7 @@ group.add_argument('--idle-timeout', type=int, default=0, help='Unload model aft
# Model loader
group = parser.add_argument_group('Model loader')
group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, HQQ, TensorRT-LLM.')
group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav3_HF, ExLlamav2_HF, ExLlamav2, HQQ, TensorRT-LLM.')
# Transformers/Accelerate
group = parser.add_argument_group('Transformers/Accelerate')
@ -118,7 +117,7 @@ group.add_argument('--quant_type', type=str, default='nf4', help='quant_type for
group = parser.add_argument_group('llama.cpp')
group.add_argument('--flash-attn', action='store_true', help='Use flash-attention.')
group.add_argument('--tensorcores', action='store_true', help='NVIDIA only: use llama-cpp-python compiled without GGML_CUDA_FORCE_MMQ. This may improve performance on newer cards.')
group.add_argument('--n_ctx', type=int, default=2048, help='Size of the prompt context.')
group.add_argument('--n_ctx', type=int, default=8192, help='Size of the prompt context.')
group.add_argument('--threads', type=int, default=0, help='Number of threads to use.')
group.add_argument('--threads-batch', type=int, default=0, help='Number of threads to use for batches/prompt processing.')
group.add_argument('--no_mul_mat_q', action='store_true', help='Disable the mulmat kernels.')
@ -140,7 +139,7 @@ group.add_argument('--tokenizer-dir', type=str, help='Load the tokenizer from th
group = parser.add_argument_group('ExLlamaV2')
group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
group.add_argument('--autosplit', action='store_true', help='Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.')
group.add_argument('--max_seq_len', type=int, default=2048, help='Maximum sequence length.')
group.add_argument('--max_seq_len', type=int, default=8192, help='Maximum sequence length.')
group.add_argument('--cfg-cache', action='store_true', help='ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader.')
group.add_argument('--no_flash_attn', action='store_true', help='Force flash-attention to not be used.')
group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.')
@ -215,6 +214,7 @@ group.add_argument('--disable_exllama', action='store_true', help='DEPRECATED')
group.add_argument('--disable_exllamav2', action='store_true', help='DEPRECATED')
group.add_argument('--wbits', type=int, default=0, help='DEPRECATED')
group.add_argument('--groupsize', type=int, default=-1, help='DEPRECATED')
group.add_argument('--model-menu', action='store_true', help='DEPRECATED')
args = parser.parse_args()
args_defaults = parser.parse_args([])
@ -273,6 +273,8 @@ def fix_loader_name(name):
return 'ExLlamav2'
elif name in ['exllamav2-hf', 'exllamav2_hf', 'exllama-v2-hf', 'exllama_v2_hf', 'exllama-v2_hf', 'exllama2-hf', 'exllama2_hf', 'exllama-2-hf', 'exllama_2_hf', 'exllama-2_hf']:
return 'ExLlamav2_HF'
elif name in ['exllamav3-hf', 'exllamav3_hf', 'exllama-v3-hf', 'exllama_v3_hf', 'exllama-v3_hf', 'exllama3-hf', 'exllama3_hf', 'exllama-3-hf', 'exllama_3_hf', 'exllama-3_hf']:
return 'ExLlamav3_HF'
elif name in ['hqq']:
return 'HQQ'
elif name in ['tensorrt', 'tensorrtllm', 'tensorrt_llm', 'tensorrt-llm', 'tensort', 'tensortllm']:

View file

@ -200,8 +200,10 @@ def create_event_handlers():
def load_model_wrapper(selected_model, loader, autoload=False):
settings = get_model_metadata(selected_model)
if not autoload:
yield f"The settings for `{selected_model}` have been updated.\n\nClick on \"Load\" to load it."
yield "### {}\n\n- Settings updated: Click \"Load\" to load the model\n- Max sequence length: {}".format(selected_model, settings['truncation_length_info'])
return
if selected_model == 'None':
@ -214,13 +216,7 @@ def load_model_wrapper(selected_model, loader, autoload=False):
shared.model, shared.tokenizer = load_model(selected_model, loader)
if shared.model is not None:
output = f"Successfully loaded `{selected_model}`."
settings = get_model_metadata(selected_model)
if 'instruction_template' in settings:
output += '\n\nIt seems to be an instruction-following model with template "{}". In the chat tab, instruct or chat-instruct modes should be used.'.format(settings['instruction_template'])
yield output
yield f"Successfully loaded `{selected_model}`."
else:
yield f"Failed to load `{selected_model}`."
except:

View file

@ -87,7 +87,7 @@ def create_ui(default_preset):
shared.gradio['static_cache'] = gr.Checkbox(value=shared.settings['static_cache'], label='Static KV cache', info='Use a static cache for improved performance.')
with gr.Column():
shared.gradio['truncation_length'] = gr.Number(precision=0, step=256, value=get_truncation_length(), label='Truncate the prompt up to this length', info='The leftmost tokens are removed if the prompt exceeds this length. Most models require this to be at most 2048.')
shared.gradio['truncation_length'] = gr.Number(precision=0, step=256, value=get_truncation_length(), label='Truncate the prompt up to this length', info='The leftmost tokens are removed if the prompt exceeds this length.')
shared.gradio['seed'] = gr.Number(value=shared.settings['seed'], label='Seed (-1 for random)')
shared.gradio['sampler_priority'] = gr.Textbox(value=generate_params['sampler_priority'], lines=12, label='Sampler priority', info='Parameter names separated by new lines or commas.', elem_classes=['add_scrollbar'])

View file

@ -16,10 +16,12 @@ import sys
# os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030'
# Define the required PyTorch version
TORCH_VERSION = "2.4.1"
TORCHVISION_VERSION = "0.19.1"
TORCHAUDIO_VERSION = "2.4.1"
# Define the required versions
TORCH_VERSION = "2.6.0"
TORCHVISION_VERSION = "0.21.0"
TORCHAUDIO_VERSION = "2.6.0"
PYTHON_VERSION = "3.11"
LIBSTDCXX_VERSION_LINUX = "12.1.0"
# Environment
script_dir = os.getcwd()
@ -101,15 +103,20 @@ def torch_version():
return torver
def update_pytorch():
def update_pytorch_and_python():
print_big_message("Checking for PyTorch updates.")
# Update the Python version. Left here for future reference in case this becomes necessary.
# print_big_message("Checking for PyTorch and Python updates.")
# current_python_version = f"{sys.version_info.major}.{sys.version_info.minor}"
# if current_python_version != PYTHON_VERSION:
# run_cmd(f"conda install -y python={PYTHON_VERSION}", assert_success=True, environment=True)
torver = torch_version()
base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} torchvision=={TORCHVISION_VERSION} torchaudio=={TORCHAUDIO_VERSION}"
if "+cu118" in torver:
install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu118"
elif "+cu" in torver:
install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu121"
if "+cu" in torver:
install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu124"
elif "+rocm" in torver:
install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.1"
elif "+cpu" in torver:
@ -236,24 +243,21 @@ def install_webui():
choice = os.environ["GPU_CHOICE"].upper()
print_big_message(f"Selected GPU choice \"{choice}\" based on the GPU_CHOICE environment variable.")
# Warn about changed meanings and handle old NVIDIA choice
# Warn about changed meanings and handle old choices
if choice == "B":
print_big_message("Warning: GPU_CHOICE='B' now means 'NVIDIA (CUDA 11.8)' in the new version.")
print_big_message("Warning: GPU_CHOICE='B' now means 'AMD' in the new version.")
elif choice == "C":
print_big_message("Warning: GPU_CHOICE='C' now means 'AMD' in the new version.")
print_big_message("Warning: GPU_CHOICE='C' now means 'Apple M Series' in the new version.")
elif choice == "D":
print_big_message("Warning: GPU_CHOICE='D' now means 'Apple M Series' in the new version.")
elif choice == "A" and "USE_CUDA118" in os.environ:
choice = "B" if os.environ.get("USE_CUDA118", "").lower() in ("yes", "y", "true", "1", "t", "on") else "A"
print_big_message("Warning: GPU_CHOICE='D' now means 'Intel Arc' in the new version.")
else:
choice = get_user_choice(
"What is your GPU?",
{
'A': 'NVIDIA - CUDA 12.1 (recommended)',
'B': 'NVIDIA - CUDA 11.8 (legacy GPUs)',
'C': 'AMD - Linux/macOS only, requires ROCm 6.1',
'D': 'Apple M Series',
'E': 'Intel Arc (beta)',
'A': 'NVIDIA - CUDA 12.4',
'B': 'AMD - Linux/macOS only, requires ROCm 6.1',
'C': 'Apple M Series',
'D': 'Intel Arc (beta)',
'N': 'CPU mode'
},
)
@ -261,15 +265,13 @@ def install_webui():
# Convert choices to GPU names for compatibility
gpu_choice_to_name = {
"A": "NVIDIA",
"B": "NVIDIA",
"C": "AMD",
"D": "APPLE",
"E": "INTEL",
"B": "AMD",
"C": "APPLE",
"D": "INTEL",
"N": "NONE"
}
selected_gpu = gpu_choice_to_name[choice]
use_cuda118 = (choice == "B") # CUDA version is now determined by menu choice
# Write a flag to CMD_FLAGS.txt for CPU mode
if selected_gpu == "NONE":
@ -280,10 +282,7 @@ def install_webui():
# Handle CUDA version display
elif any((is_windows(), is_linux())) and selected_gpu == "NVIDIA":
if use_cuda118:
print("CUDA: 11.8")
else:
print("CUDA: 12.1")
print("CUDA: 12.4")
# No PyTorch for AMD on Windows (?)
elif is_windows() and selected_gpu == "AMD":
@ -294,10 +293,7 @@ def install_webui():
install_pytorch = f"python -m pip install torch=={TORCH_VERSION} torchvision=={TORCHVISION_VERSION} torchaudio=={TORCHAUDIO_VERSION} "
if selected_gpu == "NVIDIA":
if use_cuda118 == 'Y':
install_pytorch += "--index-url https://download.pytorch.org/whl/cu118"
else:
install_pytorch += "--index-url https://download.pytorch.org/whl/cu121"
install_pytorch += "--index-url https://download.pytorch.org/whl/cu124"
elif selected_gpu == "AMD":
install_pytorch += "--index-url https://download.pytorch.org/whl/rocm6.1"
elif selected_gpu in ["APPLE", "NONE"]:
@ -310,14 +306,14 @@ def install_webui():
# Install Git and then Pytorch
print_big_message("Installing PyTorch.")
run_cmd(f"conda install -y -k ninja git && {install_pytorch} && python -m pip install py-cpuinfo==9.0.0", assert_success=True, environment=True)
run_cmd(f"conda install -y ninja git && {install_pytorch} && python -m pip install py-cpuinfo==9.0.0", assert_success=True, environment=True)
if selected_gpu == "INTEL":
# Install oneAPI dependencies via conda
print_big_message("Installing Intel oneAPI runtime libraries.")
run_cmd("conda install -y -c https://software.repos.intel.com/python/conda/ -c conda-forge dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0")
run_cmd("conda install -y -c https://software.repos.intel.com/python/conda/ -c conda-forge dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0", environment=True)
# Install libuv required by Intel-patched torch
run_cmd("conda install -y libuv")
run_cmd("conda install -y libuv", environment=True)
# Install the webui requirements
update_requirements(initial_installation=True, pull=False)
@ -336,6 +332,24 @@ def install_extensions_requirements():
run_cmd(f"python -m pip install -r {extension_req_path} --upgrade", assert_success=False, environment=True)
def clean_outdated_pytorch_cuda_dependencies():
patterns = ["cu121", "cu122", "torch2.4"]
result = run_cmd("python -m pip list --format=freeze", capture_output=True, environment=True)
matching_packages = []
for line in result.stdout.decode('utf-8').splitlines():
if "==" in line:
pkg_name, version = line.split('==', 1)
if any(pattern in version for pattern in patterns):
matching_packages.append(pkg_name)
if matching_packages:
print(f"\nUninstalling: {', '.join(matching_packages)}\n")
run_cmd(f"python -m pip uninstall -y {' '.join(matching_packages)}", assert_success=True, environment=True)
return matching_packages
def update_requirements(initial_installation=False, pull=True):
# Create .git directory if missing
if not os.path.exists(os.path.join(script_dir, ".git")):
@ -421,9 +435,14 @@ def update_requirements(initial_installation=False, pull=True):
if os.environ.get("INSTALL_EXTENSIONS", "").lower() in ("yes", "y", "true", "1", "t", "on"):
install_extensions_requirements()
if is_linux():
run_cmd(f"conda install -y -c conda-forge libstdcxx-ng=={LIBSTDCXX_VERSION_LINUX}", assert_success=True, environment=True)
# Update PyTorch
if not initial_installation:
update_pytorch()
update_pytorch_and_python()
torver = torch_version()
clean_outdated_pytorch_cuda_dependencies()
print_big_message(f"Installing webui requirements from file: {requirements_file}")
print(f"TORCH: {torver}\n")
@ -434,16 +453,6 @@ def update_requirements(initial_installation=False, pull=True):
if not initial_installation and not wheels_changed:
textgen_requirements = [line for line in textgen_requirements if '.whl' not in line]
if "+cu118" in torver:
textgen_requirements = [
req.replace('+cu121', '+cu118').replace('+cu122', '+cu118')
for req in textgen_requirements
if "autoawq" not in req.lower()
]
if is_windows() and "+cu118" in torver: # No flash-attention on Windows for CUDA 11
textgen_requirements = [req for req in textgen_requirements if 'oobabooga/flash-attention' not in req]
with open('temp_requirements.txt', 'w') as file:
file.write('\n'.join(textgen_requirements))

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
bitsandbytes==0.45.*
colorama
datasets
@ -10,7 +10,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -21,7 +21,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -33,29 +33,21 @@ tiktoken
# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
# llama-cpp-python (CUDA, with GGML_CUDA_FORCE_MMQ)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
# llama-cpp-python (CUDA, without GGML_CUDA_FORCE_MMQ)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
# CUDA wheels
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,13 +32,9 @@ tiktoken
# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
# AMD wheels
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.8+rocm6.1.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.8+rocm6.1.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,11 +32,8 @@ tiktoken
# llama-cpp-python (CPU only, no AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
# AMD wheels
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,7 +32,6 @@ tiktoken
# Mac wheels
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1-py3-none-any.whl
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,9 +32,7 @@ tiktoken
# Mac wheels
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1-py3-none-any.whl
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,6 +32,4 @@ tiktoken
# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -32,6 +32,4 @@ tiktoken
# llama-cpp-python (CPU only, no AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
bitsandbytes==0.45.*
colorama
datasets
@ -10,7 +10,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -21,7 +21,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb
@ -33,29 +33,21 @@ tiktoken
# llama-cpp-python (CPU only, no AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
# llama-cpp-python (CUDA, with GGML_CUDA_FORCE_MMQ)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
# llama-cpp-python (CUDA, without GGML_CUDA_FORCE_MMQ)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
# CUDA wheels
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -1,4 +1,4 @@
accelerate==1.4.*
accelerate==1.5.*
colorama
datasets
einops
@ -9,7 +9,7 @@ markdown
numba==0.59.*
numpy==1.26.*
pandas
peft==0.12.*
peft==0.15.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
@ -20,7 +20,7 @@ safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.49.*
transformers==4.50.*
tqdm
wandb

View file

@ -218,28 +218,10 @@ if __name__ == "__main__":
if extension not in shared.args.extensions:
shared.args.extensions.append(extension)
available_models = utils.get_available_models()
# Model defined through --model
if shared.args.model is not None:
shared.model_name = shared.args.model
# Select the model from a command-line menu
elif shared.args.model_menu:
if len(available_models) == 0:
logger.error('No models are available! Please download at least one.')
sys.exit(0)
else:
print('The following models are available:\n')
for i, model in enumerate(available_models):
print(f'{i+1}. {model}')
print(f'\nWhich one do you want to load? 1-{len(available_models)}\n')
i = int(input()) - 1
print()
shared.model_name = available_models[i]
# If any model has been selected, load it
if shared.model_name != 'None':
p = Path(shared.model_name)

View file

@ -25,7 +25,7 @@ add_bos_token: true
skip_special_tokens: true
stream: true
static_cache: false
truncation_length: 2048
truncation_length: 8192
seed: -1
custom_stopping_strings: ''
custom_token_bans: ''