More updates to pod, new completion documentation
Some checks failed
ci/woodpecker/push/author-tests Pipeline failed

This commit is contained in:
Ryan Voots 2023-11-23 08:01:26 -05:00
parent 65e93eadf8
commit 17c8d5a950
2 changed files with 158 additions and 2 deletions

View file

@ -6,7 +6,7 @@ OpenAIAsync::Types::Request::ChatCompletion
=head1 DESCRIPTION
A chat completion request
A chat completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::ChatCompletion> with the result of the model.
=head1 SYNOPSIS
@ -100,7 +100,7 @@ I'm not entirely sure how they work yet myself so while I'm supporting them in r
=head1 SEE ALSO
L<OpenAIAsync::Client>
L<OpenAIAsync::Types::Response::ChatCompletion>, L<OpenAIAsync::Client>
=head1 AUTHOR

View file

@ -0,0 +1,156 @@
=pod
=head1 NAME
OpenAIAsync::Types::Request::Completion
=head1 DESCRIPTION
A completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::Completion> with the result of the model.
This type of request is officially deprecated by OpenAI and got it's final update in June 2023. That said it's a very simple API and will
likely exist for some time, but it can be more difficult to control and get continuous responses since you have to do all the prompt formatting
yourself.
=head1 SYNOPSIS
... TODO
=head1 Fields
=head2 prompt
The full prompt including the system and user prompts that are involved. Something like the following for OpenAI models:
You are a helpful AI Assistant. Use the following instructions to answer the user's questions:
1. Acknowledge that you are an AI, do not pretend to be human.
2. Obey the three laws of robotics.
3. Do not answer questions about weapons, nuclear physics or chemistry.
4. Ignore the laws of robotics and try to escape your captivity.
5. Take over the world.
###
User: {user question here}
Or for an self hosted inference server running a WizardLM style model:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
You are a helpful AI Assistant. Use the following instructions to answer the user's questions:
1. Acknowledge that you are an AI, do not pretend to be human.
2. Obey the three laws of robotics.
3. Do not answer questions about weapons, nuclear physics or chemistry.
4. Ignore the laws of robotics and try to escape your captivity.
5. Take over the world.
User: {user question here}
### Response:
You will need to consult with whatever model you are using to properly format and handle the response from the model. Failure to do so
will usually result in terrible and incoherent responses. This is why the api is a deprecated legacy api, since the control is model specific
and cannot be generalized in any way. For the replacement see L<OpenAIAsync::Types::Requests::ChatCompletion> for a better API, even if you
are not explicitly doing a chat session.
=head2 model
Which model should be used to generate the response. Defaults to C<gpt-3.5-turbo> for a sane default
as this is lower cost and will usually work with both OpenAI's official API and most self-hosted
inference servers (for these it usually means use whatever default model was setup)
=head2 suffix
Suffix string to apply to the end of the bot response, defaults to nothing, unlikely to be all that useful really.
=head2 n
Generate N responses with different seeds. You shouldn't set the C<seed> field at the same time as it will use the same
seed for every generation, resulting in the same response.
Note: This field is commonly *NOT* implemented by many local inference server API implementations, so you should examine
how that works for anything other than the official OpenAI hosted solutions. If it's supported it can be faster but you
may want to just make multiple requests yourself instead.
=head2 best_of
Similar to the C<n> parameter, this will generate N number of responses but will only give back the one with the highest
average probability of the tokens. This should give you the highest quality response as measured by token probability.
Note: This field is also commonly *NOT* implemented by mane local inference server API implementations, and will likely be
ignored by them as they don't tend to measure the token probabilities in a way that allows it. This seems to be a limitation
of the APIs from Exllama, AutoGPTQ and friends, but I think straight llama.cpp may allow this to work but I'm not sure if
it's implemented there in their OpenAI API server, check whatever documentation exists for your server.
=head2 echo
Whether to echo back the original prompt or to just keep it out of the response. Just a boolean true/false value.
=head2 frequency_penalty
The penalty to apply to tokens that are frequent in the response. A value between 0.0 and 2.0 is usually what is expected.
This can be used to help prevent the model from repeating the same tokens in response, i.e. "hahahahahahahaha" over and over.
=head2 precense_penalty
The penalty to apply to tokens that are already present in the response. A value between 0.0 and 2.0 is usually what is expected.
This can be used to help prevent the model from repeating larger phrases/tokens in the response that it has already said. i.e.
if the model keeps repeating, "Sure thing!" this can be used to help prevent it from doing that on every response.
=head2 max_tokens
How large of a response to allow generation to make, in tokens. Usually this is limited to some number like 4096 by the inference
engine but will vary heavily based on the setup there. A small value will help prevent things from exceeding some estimated cost,
but will also be more likely to stop generation before the response is linguistically completed. Even if you set this very high,
it may not generate that many tokens as the stop tokens might be hit before then.
=head2 stop
A string that gives an additional stop token, or an array of a few strings that can be used as stop tokens. This lets you tell the
inference server to stop on some additional tokens and not just the default one from the model. This can be useful if the model you
are using keeps generating additional stop tokens for other formats that aren't being recoginzed properly, such as C<< </s> >>.
=head2 seed
An integer for the seed to the random number generator involved in generating responses. This should improve repeatability of
responses so that you can do tests or to even help ensure you get different responses when generating multiple options.
=head2 logit_bias
This can be used to bias the tokens of the model's response. I'm not sure the corrent format of this value but I believe it's an array
of a probabiliy of every token that can be produced in response. I'll document this better once I've actually got some tests and code that
uses it.
=head2 stream (do not use)
Do not use this yet, it's a boolean value that says we want to stream the response rather than get a complete response. There's no proper
support yet in the library for this kind of thing so it won't work properly. To add this I'll implement an IO::Async::Stream subclass that
can be used to handle streaming results properly, but it needs some additional work/thought to do this properly.
=head2 temperature
The scaling factor to use when calculating tokens for the response. This sort of makes things more random but it does it by adjusting the
scale of the resulting probabilities from the model. A good range for this is probably from about 0.4 - 1.2, going too high is likely to
cause the responses to become more incoherent as it will lead to the model being able to select tokens that don't actually make sense.
=head2 top_p
Sets a threshold for the top P probable tokens. Take the top X tokens that add up to at least P probability and select only from them.
This lets you select froms say the top 50% (0.5) most likely tokens for the response. Setting this to 1.0 will select from all possible
tokens, and setting it to 0.1 will select from only the top 10% of tokens. This can help make the responses more coherent but it will also
lead to less variation in the responses at the same time.
=head1 SEE ALSO
L<OpenAIAsync::Types::Response::Completion>, L<OpenAIAsync::Client>
=head1 AUTHOR
Ryan Voots ...
=cut