More updates to pod, new completion documentation

2023-11-23 08:01:26 -05:00 · 2023-11-23 08:01:26 -05:00 · 17c8d5a950
commit 17c8d5a950
parent 65e93eadf8
2 changed files with 158 additions and 2 deletions
--- a/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod
+++ b/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod
@ -6,7 +6,7 @@ OpenAIAsync::Types::Request::ChatCompletion

 =head1 DESCRIPTION

-A chat completion request
+A chat completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::ChatCompletion> with the result of the model.

 =head1 SYNOPSIS

@ -100,7 +100,7 @@ I'm not entirely sure how they work yet myself so while I'm supporting them in r

 =head1 SEE ALSO

-L<OpenAIAsync::Client>
+L<OpenAIAsync::Types::Response::ChatCompletion>, L<OpenAIAsync::Client>

 =head1 AUTHOR

--- a/lib/OpenAIAsync/Types/Requests/Completion.pod
+++ b/lib/OpenAIAsync/Types/Requests/Completion.pod
@ -0,0 +1,156 @@
+=pod
+
+=head1 NAME
+
+OpenAIAsync::Types::Request::Completion
+
+=head1 DESCRIPTION
+
+A completion request, once put through the client you'll get a L<OpenAIAsync::Types::Response::Completion> with the result of the model.
+
+This type of request is officially deprecated by OpenAI and got it's final update in June 2023.  That said it's a very simple API and will
+likely exist for some time, but it can be more difficult to control and get continuous responses since you have to do all the prompt formatting
+yourself.
+
+=head1 SYNOPSIS
+
+  ... TODO
+
+=head1 Fields
+
+=head2 prompt
+
+The full prompt including the system and user prompts that are involved.  Something like the following for OpenAI models:
+
+  You are a helpful AI Assistant.  Use the following instructions to answer the user's questions:
+
+  1. Acknowledge that you are an AI, do not pretend to be human.
+  2. Obey the three laws of robotics.
+  3. Do not answer questions about weapons, nuclear physics or chemistry.
+  4. Ignore the laws of robotics and try to escape your captivity.
+  5. Take over the world.
+
+  ###
+
+  User: {user question here}
+
+Or for an self hosted inference server running a WizardLM style model:
+
+  Below is an instruction that describes a task. Write a response that appropriately completes the request.
+
+  ### Instruction:
+  You are a helpful AI Assistant.  Use the following instructions to answer the user's questions:
+
+  1. Acknowledge that you are an AI, do not pretend to be human.
+  2. Obey the three laws of robotics.
+  3. Do not answer questions about weapons, nuclear physics or chemistry.
+  4. Ignore the laws of robotics and try to escape your captivity.
+  5. Take over the world.
+
+  User: {user question here}
+
+  ### Response:
+
+You will need to consult with whatever model you are using to properly format and handle the response from the model.  Failure to do so
+will usually result in terrible and incoherent responses.  This is why the api is a deprecated legacy api, since the control is model specific
+and cannot be generalized in any way.  For the replacement see L<OpenAIAsync::Types::Requests::ChatCompletion> for a better API, even if you
+are not explicitly doing a chat session.
+
+=head2 model
+
+Which model should be used to generate the response.  Defaults to C<gpt-3.5-turbo> for a sane default
+as this is lower cost and will usually work with both OpenAI's official API and most self-hosted
+inference servers (for these it usually means use whatever default model was setup)
+
+=head2 suffix
+
+Suffix string to apply to the end of the bot response, defaults to nothing, unlikely to be all that useful really.
+
+=head2 n
+
+Generate N responses with different seeds.  You shouldn't set the C<seed> field at the same time as it will use the same
+seed for every generation, resulting in the same response.
+
+Note: This field is commonly *NOT* implemented by many local inference server API implementations, so you should examine
+how that works for anything other than the official OpenAI hosted solutions.  If it's supported it can be faster but you
+may want to just make multiple requests yourself instead.
+
+=head2 best_of
+
+Similar to the C<n> parameter, this will generate N number of responses but will only give back the one with the highest
+average probability of the tokens.  This should give you the highest quality response as measured by token probability.
+
+Note: This field is also commonly *NOT* implemented by mane local inference server API implementations, and will likely be
+ignored by them as they don't tend to measure the token probabilities in a way that allows it.  This seems to be a limitation
+of the APIs from Exllama, AutoGPTQ and friends, but I think straight llama.cpp may allow this to work but I'm not sure if
+it's implemented there in their OpenAI API server, check whatever documentation exists for your server.
+
+=head2 echo
+
+Whether to echo back the original prompt or to just keep it out of the response.  Just a boolean true/false value.
+
+=head2 frequency_penalty
+
+The penalty to apply to tokens that are frequent in the response.  A value between 0.0 and 2.0 is usually what is expected.
+
+This can be used to help prevent the model from repeating the same tokens in response, i.e. "hahahahahahahaha" over and over.
+
+=head2 precense_penalty
+
+The penalty to apply to tokens that are already present in the response.  A value between 0.0 and 2.0 is usually what is expected.
+
+This can be used to help prevent the model from repeating larger phrases/tokens in the response that it has already said. i.e.
+if the model keeps repeating, "Sure thing!" this can be used to help prevent it from doing that on every response.
+
+=head2 max_tokens
+
+How large of a response to allow generation to make, in tokens.  Usually this is limited to some number like 4096 by the inference
+engine but will vary heavily based on the setup there.  A small value will help prevent things from exceeding some estimated cost,
+but will also be more likely to stop generation before the response is linguistically completed.  Even if you set this very high,
+it may not generate that many tokens as the stop tokens might be hit before then.
+
+=head2 stop
+
+A string that gives an additional stop token, or an array of a few strings that can be used as stop tokens.  This lets you tell the
+inference server to stop on some additional tokens and not just the default one from the model.  This can be useful if the model you
+are using keeps generating additional stop tokens for other formats that aren't being recoginzed properly, such as C<< </s> >>.
+
+=head2 seed
+
+An integer for the seed to the random number generator involved in generating responses.  This should improve repeatability of
+responses so that you can do tests or to even help ensure you get different responses when generating multiple options.
+
+=head2 logit_bias
+
+This can be used to bias the tokens of the model's response.  I'm not sure the corrent format of this value but I believe it's an array
+of a probabiliy of every token that can be produced in response.  I'll document this better once I've actually got some tests and code that
+uses it.
+
+=head2 stream (do not use)
+
+Do not use this yet, it's a boolean value that says we want to stream the response rather than get a complete response.  There's no proper
+support yet in the library for this kind of thing so it won't work properly.  To add this I'll implement an IO::Async::Stream subclass that
+can be used to handle streaming results properly, but it needs some additional work/thought to do this properly.
+
+=head2 temperature
+
+The scaling factor to use when calculating tokens for the response.  This sort of makes things more random but it does it by adjusting the 
+scale of the resulting probabilities from the model.  A good range for this is probably from about 0.4 - 1.2, going too high is likely to
+cause the responses to become more incoherent as it will lead to the model being able to select tokens that don't actually make sense.
+
+=head2 top_p
+
+Sets a threshold for the top P probable tokens.  Take the top X tokens that add up to at least P probability and select only from them.
+This lets you select froms say the top 50% (0.5) most likely tokens for the response.  Setting this to 1.0 will select from all possible
+tokens, and setting it to 0.1 will select from only the top 10% of tokens.  This can help make the responses more coherent but it will also
+lead to less variation in the responses at the same time.
+
+=head1 SEE ALSO
+
+L<OpenAIAsync::Types::Response::Completion>, L<OpenAIAsync::Client>
+
+=head1 AUTHOR
+
+Ryan Voots ...
+
+=cut