Commit graph

48 commits

Author SHA1 Message Date
oobabooga
7f7909be54 Merge branch 'dev' into multimodal_gguf 2025-06-05 10:52:06 -07:00
oobabooga
2db7745cbd Show llama.cpp prompt processing on one line instead of many lines 2025-06-01 22:12:24 -07:00
oobabooga
9d7894a13f Organize 2025-05-28 10:10:26 -07:00
oobabooga
f92e1f44a0 Add multimodal support (llama.cpp) 2025-05-28 05:52:07 -07:00
oobabooga
e4d3f4449d API: Fix a regression 2025-05-16 13:02:27 -07:00
oobabooga
5534d01da0
Estimate the VRAM for GGUF models + autoset gpu-layers (#6980) 2025-05-16 00:07:37 -03:00
oobabooga
62c774bf24 Revert "New attempt"
This reverts commit e7ac06c169.
2025-05-13 06:42:25 -07:00
oobabooga
e7ac06c169 New attempt 2025-05-10 19:20:04 -07:00
oobabooga
9ea2a69210 llama.cpp: Add --no-webui to the llama-server command 2025-05-08 10:41:25 -07:00
oobabooga
c4f36db0d8 llama.cpp: remove tfs (it doesn't get used) 2025-05-06 08:41:13 -07:00
oobabooga
d1c0154d66 llama.cpp: Add top_n_sigma, fix typical_p in sampler priority 2025-05-06 06:38:39 -07:00
oobabooga
b817bb33fd Minor fix after df7bb0db1f 2025-05-05 05:00:20 -07:00
oobabooga
b7a5c7db8d llama.cpp: Handle short arguments in --extra-flags 2025-05-04 07:14:42 -07:00
oobabooga
4c2e3b168b llama.cpp: Add a retry mechanism when getting the logits (sometimes it fails) 2025-05-03 06:51:20 -07:00
oobabooga
b950a0c6db Lint 2025-04-30 20:02:10 -07:00
oobabooga
a6c3ec2299 llama.cpp: Explicitly send cache_prompt = True 2025-04-30 15:24:07 -07:00
oobabooga
1ee0acc852 llama.cpp: Make --verbose print the llama-server command 2025-04-28 15:56:25 -07:00
oobabooga
c6c2855c80 llama.cpp: Remove the timeout while loading models (closes #6907) 2025-04-27 21:22:21 -07:00
oobabooga
7b80acd524 Fix parsing --extra-flags 2025-04-26 18:40:03 -07:00
oobabooga
234aba1c50 llama.cpp: Simplify the prompt processing progress indicator
The progress bar was unreliable
2025-04-26 17:33:47 -07:00
oobabooga
d4b1e31c49 Use --ctx-size to specify the context size for all loaders
Old flags are still recognized as alternatives.
2025-04-25 16:59:03 -07:00
oobabooga
faababc4ea llama.cpp: Add a prompt processing progress bar 2025-04-25 16:42:30 -07:00
oobabooga
877cf44c08 llama.cpp: Add StreamingLLM (--streaming-llm) 2025-04-25 16:21:41 -07:00
oobabooga
98f4c694b9 llama.cpp: Add --extra-flags parameter for passing additional flags to llama-server 2025-04-25 07:32:51 -07:00
oobabooga
e99c20bcb0
llama.cpp: Add speculative decoding (#6891) 2025-04-23 20:10:16 -03:00
Matthew Jenkins
d3e7c655e5
Add support for llama-cpp builds from https://github.com/ggml-org/llama.cpp (#6862) 2025-04-20 23:06:24 -03:00
oobabooga
5ab069786b llama.cpp: add back the two encode calls (they are harmless now) 2025-04-19 17:38:36 -07:00
oobabooga
b9da5c7e3a Use 127.0.0.1 instead of localhost for faster llama.cpp on Windows 2025-04-19 17:36:04 -07:00
oobabooga
9c9df2063f llama.cpp: fix unicode decoding (closes #6856) 2025-04-19 16:38:15 -07:00
oobabooga
ba976d1390 llama.cpp: avoid two 'encode' calls 2025-04-19 16:35:01 -07:00
oobabooga
ed42154c78 Revert "llama.cpp: close the connection immediately on 'Stop'"
This reverts commit 5fdebc554b.
2025-04-19 05:32:36 -07:00
oobabooga
5fdebc554b llama.cpp: close the connection immediately on 'Stop' 2025-04-19 04:59:24 -07:00
oobabooga
6589ebeca8 Revert "llama.cpp: new optimization attempt"
This reverts commit e2e73ed22f.
2025-04-18 21:16:21 -07:00
oobabooga
e2e73ed22f llama.cpp: new optimization attempt 2025-04-18 21:05:08 -07:00
oobabooga
e2e90af6cd llama.cpp: don't include --rope-freq-base in the launch command if null 2025-04-18 20:51:18 -07:00
oobabooga
9f07a1f5d7 llama.cpp: new attempt at optimizing the llama-server connection 2025-04-18 19:30:53 -07:00
oobabooga
f727b4a2cc llama.cpp: close the connection properly when generation is cancelled 2025-04-18 19:01:39 -07:00
oobabooga
b3342b8dd8 llama.cpp: optimize the llama-server connection 2025-04-18 18:46:36 -07:00
oobabooga
2002590536 Revert "Attempt at making the llama-server streaming more efficient."
This reverts commit 5ad080ff25.
2025-04-18 18:13:54 -07:00
oobabooga
71ae05e0a4 llama.cpp: Fix the sampler priority handling 2025-04-18 18:06:36 -07:00
oobabooga
5ad080ff25 Attempt at making the llama-server streaming more efficient. 2025-04-18 18:04:49 -07:00
oobabooga
4fabd729c9 Fix the API without streaming or without 'sampler_priority' (closes #6851) 2025-04-18 17:25:22 -07:00
oobabooga
5135523429 Fix the new llama.cpp loader failing to unload models 2025-04-18 17:10:26 -07:00
oobabooga
caa6afc88b Only show 'GENERATE_PARAMS=...' in the logits endpoint if use_logits is True 2025-04-18 09:57:57 -07:00
oobabooga
d00d713ace Rename get_max_context_length to get_vocabulary_size in the new llama.cpp loader 2025-04-18 08:14:15 -07:00
oobabooga
c1cc65e82e Lint 2025-04-18 08:06:51 -07:00
oobabooga
a0abf93425 Connect --rope-freq-base to the new llama.cpp loader 2025-04-18 06:53:51 -07:00
oobabooga
ae54d8faaa
New llama.cpp loader (#6846) 2025-04-18 09:59:37 -03:00