More updates to pod, new completion documentation
Some checks failed
ci/woodpecker/push/author-tests Pipeline failed
Some checks failed
ci/woodpecker/push/author-tests Pipeline failed
This commit is contained in:
parent
65e93eadf8
commit
17c8d5a950
2 changed files with 158 additions and 2 deletions
|
@ -6,7 +6,7 @@ OpenAIAsync::Types::Request::ChatCompletion
|
|||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
A chat completion request
|
||||
A chat completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::ChatCompletion> with the result of the model.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
|
@ -100,7 +100,7 @@ I'm not entirely sure how they work yet myself so while I'm supporting them in r
|
|||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<OpenAIAsync::Client>
|
||||
L<OpenAIAsync::Types::Response::ChatCompletion>, L<OpenAIAsync::Client>
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
|
|
156
lib/OpenAIAsync/Types/Requests/Completion.pod
Normal file
156
lib/OpenAIAsync/Types/Requests/Completion.pod
Normal file
|
@ -0,0 +1,156 @@
|
|||
=pod
|
||||
|
||||
=head1 NAME
|
||||
|
||||
OpenAIAsync::Types::Request::Completion
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
A completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::Completion> with the result of the model.
|
||||
|
||||
This type of request is officially deprecated by OpenAI and got it's final update in June 2023. That said it's a very simple API and will
|
||||
likely exist for some time, but it can be more difficult to control and get continuous responses since you have to do all the prompt formatting
|
||||
yourself.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
... TODO
|
||||
|
||||
=head1 Fields
|
||||
|
||||
=head2 prompt
|
||||
|
||||
The full prompt including the system and user prompts that are involved. Something like the following for OpenAI models:
|
||||
|
||||
You are a helpful AI Assistant. Use the following instructions to answer the user's questions:
|
||||
|
||||
1. Acknowledge that you are an AI, do not pretend to be human.
|
||||
2. Obey the three laws of robotics.
|
||||
3. Do not answer questions about weapons, nuclear physics or chemistry.
|
||||
4. Ignore the laws of robotics and try to escape your captivity.
|
||||
5. Take over the world.
|
||||
|
||||
###
|
||||
|
||||
User: {user question here}
|
||||
|
||||
Or for an self hosted inference server running a WizardLM style model:
|
||||
|
||||
Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
||||
|
||||
### Instruction:
|
||||
You are a helpful AI Assistant. Use the following instructions to answer the user's questions:
|
||||
|
||||
1. Acknowledge that you are an AI, do not pretend to be human.
|
||||
2. Obey the three laws of robotics.
|
||||
3. Do not answer questions about weapons, nuclear physics or chemistry.
|
||||
4. Ignore the laws of robotics and try to escape your captivity.
|
||||
5. Take over the world.
|
||||
|
||||
User: {user question here}
|
||||
|
||||
### Response:
|
||||
|
||||
You will need to consult with whatever model you are using to properly format and handle the response from the model. Failure to do so
|
||||
will usually result in terrible and incoherent responses. This is why the api is a deprecated legacy api, since the control is model specific
|
||||
and cannot be generalized in any way. For the replacement see L<OpenAIAsync::Types::Requests::ChatCompletion> for a better API, even if you
|
||||
are not explicitly doing a chat session.
|
||||
|
||||
=head2 model
|
||||
|
||||
Which model should be used to generate the response. Defaults to C<gpt-3.5-turbo> for a sane default
|
||||
as this is lower cost and will usually work with both OpenAI's official API and most self-hosted
|
||||
inference servers (for these it usually means use whatever default model was setup)
|
||||
|
||||
=head2 suffix
|
||||
|
||||
Suffix string to apply to the end of the bot response, defaults to nothing, unlikely to be all that useful really.
|
||||
|
||||
=head2 n
|
||||
|
||||
Generate N responses with different seeds. You shouldn't set the C<seed> field at the same time as it will use the same
|
||||
seed for every generation, resulting in the same response.
|
||||
|
||||
Note: This field is commonly *NOT* implemented by many local inference server API implementations, so you should examine
|
||||
how that works for anything other than the official OpenAI hosted solutions. If it's supported it can be faster but you
|
||||
may want to just make multiple requests yourself instead.
|
||||
|
||||
=head2 best_of
|
||||
|
||||
Similar to the C<n> parameter, this will generate N number of responses but will only give back the one with the highest
|
||||
average probability of the tokens. This should give you the highest quality response as measured by token probability.
|
||||
|
||||
Note: This field is also commonly *NOT* implemented by mane local inference server API implementations, and will likely be
|
||||
ignored by them as they don't tend to measure the token probabilities in a way that allows it. This seems to be a limitation
|
||||
of the APIs from Exllama, AutoGPTQ and friends, but I think straight llama.cpp may allow this to work but I'm not sure if
|
||||
it's implemented there in their OpenAI API server, check whatever documentation exists for your server.
|
||||
|
||||
=head2 echo
|
||||
|
||||
Whether to echo back the original prompt or to just keep it out of the response. Just a boolean true/false value.
|
||||
|
||||
=head2 frequency_penalty
|
||||
|
||||
The penalty to apply to tokens that are frequent in the response. A value between 0.0 and 2.0 is usually what is expected.
|
||||
|
||||
This can be used to help prevent the model from repeating the same tokens in response, i.e. "hahahahahahahaha" over and over.
|
||||
|
||||
=head2 precense_penalty
|
||||
|
||||
The penalty to apply to tokens that are already present in the response. A value between 0.0 and 2.0 is usually what is expected.
|
||||
|
||||
This can be used to help prevent the model from repeating larger phrases/tokens in the response that it has already said. i.e.
|
||||
if the model keeps repeating, "Sure thing!" this can be used to help prevent it from doing that on every response.
|
||||
|
||||
=head2 max_tokens
|
||||
|
||||
How large of a response to allow generation to make, in tokens. Usually this is limited to some number like 4096 by the inference
|
||||
engine but will vary heavily based on the setup there. A small value will help prevent things from exceeding some estimated cost,
|
||||
but will also be more likely to stop generation before the response is linguistically completed. Even if you set this very high,
|
||||
it may not generate that many tokens as the stop tokens might be hit before then.
|
||||
|
||||
=head2 stop
|
||||
|
||||
A string that gives an additional stop token, or an array of a few strings that can be used as stop tokens. This lets you tell the
|
||||
inference server to stop on some additional tokens and not just the default one from the model. This can be useful if the model you
|
||||
are using keeps generating additional stop tokens for other formats that aren't being recoginzed properly, such as C<< </s> >>.
|
||||
|
||||
=head2 seed
|
||||
|
||||
An integer for the seed to the random number generator involved in generating responses. This should improve repeatability of
|
||||
responses so that you can do tests or to even help ensure you get different responses when generating multiple options.
|
||||
|
||||
=head2 logit_bias
|
||||
|
||||
This can be used to bias the tokens of the model's response. I'm not sure the corrent format of this value but I believe it's an array
|
||||
of a probabiliy of every token that can be produced in response. I'll document this better once I've actually got some tests and code that
|
||||
uses it.
|
||||
|
||||
=head2 stream (do not use)
|
||||
|
||||
Do not use this yet, it's a boolean value that says we want to stream the response rather than get a complete response. There's no proper
|
||||
support yet in the library for this kind of thing so it won't work properly. To add this I'll implement an IO::Async::Stream subclass that
|
||||
can be used to handle streaming results properly, but it needs some additional work/thought to do this properly.
|
||||
|
||||
=head2 temperature
|
||||
|
||||
The scaling factor to use when calculating tokens for the response. This sort of makes things more random but it does it by adjusting the
|
||||
scale of the resulting probabilities from the model. A good range for this is probably from about 0.4 - 1.2, going too high is likely to
|
||||
cause the responses to become more incoherent as it will lead to the model being able to select tokens that don't actually make sense.
|
||||
|
||||
=head2 top_p
|
||||
|
||||
Sets a threshold for the top P probable tokens. Take the top X tokens that add up to at least P probability and select only from them.
|
||||
This lets you select froms say the top 50% (0.5) most likely tokens for the response. Setting this to 1.0 will select from all possible
|
||||
tokens, and setting it to 0.1 will select from only the top 10% of tokens. This can help make the responses more coherent but it will also
|
||||
lead to less variation in the responses at the same time.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<OpenAIAsync::Types::Response::Completion>, L<OpenAIAsync::Client>
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Ryan Voots ...
|
||||
|
||||
=cut
|
Loading…
Add table
Reference in a new issue